Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Program deadlocks, on simple send/recv loop
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-12-03 11:19:43


Ashley Pittman wrote:

>On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
>
>
>>On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
>>
>>
>>>On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
>>>
>>>
>>>>The attached code, is an example where openmpi/1.3.2 will lock up, if
>>>>ran on 48 cores, of IB (4 cores per node),
>>>>The code loops over recv from all processors on rank 0 and sends from
>>>>all other ranks, as far as I know this should work, and I can't see
>>>>why not.
>>>>Note yes I know we can do the same thing with a gather, this is a
>>>>simple case to demonstrate the issue.
>>>>Note that if I increase the openib eager limit, the program runs,
>>>>which normally means improper MPI, but I can't on my own figure out
>>>>the problem with this code.
>>>>
>>>>
>>>What are you increasing the eager limit from and too?
>>>
>>>
>>The same value as ethernet on our system,
>>mpirun --mca btl_openib_eager_limit 655360 --mca
>>btl_openib_max_send_size 655360 ./a.out
>>
>>Huge values compared to the defaults, but works,
>>
>>
>My understanding of the code is that each message will be 256k long
>
Yes. Brock's Fortran code has each nonzero rank send 50 messages, each
256K, via standard send to rank 0. Rank 0 uses standard receives on
them all, pulling in all 50 messages in order from rank 1, then from
rank 2, etc.
http://www.open-mpi.org/community/lists/users/2009/12/11311.php

John Cary sent out a C++ code on this same e-mail thread. It sends
256*8=2048-byte messages. Each nonzero rank sends 1 message and rank 0
pulls these in in rank order. Then there is a barrier. The program
iterates on this pattern.
http://www.open-mpi.org/community/lists/users/2009/12/11348.php

I can imagine the two programs are illustrating two different problems.