I found what caused the problem in both cases.
--- ompi/mca/btl/sm/btl_sm.c (revision 18675)
+++ ompi/mca/btl/sm/btl_sm.c (working copy)
@@ -812,7 +812,7 @@
endpoint->peer_smp_rank, frag->hdr, false, rc);
- return (rc < 0 ? rc : 1);
+ return OMPI_SUCCESS;
I am just not sure if it's OK.
On Wed, Jun 18, 2008 at 3:21 PM, Lenny Verkhovsky <
> I am not sure if it related,
> but I applied your patch ( r18667 ) to r 18656 ( one before NUMA )
> together with disabling sendi,
> The result still the same ( hanging ).
> On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosilca_at_[hidden]>
>> I guess you're running the latest version. If not, please update, Galen
>> and myself corrected some bugs last week. If you're using the latest (and
>> greatest) then ... well I imagine there is at least one bug left.
>> There is a quick test you can do. In the btl_sm.c in the module structure
>> at the beginning of the file, please replace the sendi function by NULL. If
>> this fix the problem, then at least we know that it's a sm send immediate
>> On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:
>> Hi, George,
>>> I have a problem running BW benchmark on 100 rank cluster after r18551.
>>> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
>>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18549 -t bw -s 100000
>>> BW (100) (size min max avg) 100000 576.734030 2001.882416
>>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000
>>> mpirun: killing job...
>>> ( it hangs even after 10 hours ).
>>> It doesn't happen if I run --bynode or btl openib,self only.