Ashley Pittman wrote:
>On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
>>On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
>>>On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
>>>>The attached code, is an example where openmpi/1.3.2 will lock up, if
>>>>ran on 48 cores, of IB (4 cores per node),
>>>>The code loops over recv from all processors on rank 0 and sends from
>>>>all other ranks, as far as I know this should work, and I can't see
>>>>Note yes I know we can do the same thing with a gather, this is a
>>>>simple case to demonstrate the issue.
>>>>Note that if I increase the openib eager limit, the program runs,
>>>>which normally means improper MPI, but I can't on my own figure out
>>>>the problem with this code.
>>>What are you increasing the eager limit from and too?
>>The same value as ethernet on our system,
>>mpirun --mca btl_openib_eager_limit 655360 --mca
>>btl_openib_max_send_size 655360 ./a.out
>>Huge values compared to the defaults, but works,
>My understanding of the code is that each message will be 256k long
Yes. Brock's Fortran code has each nonzero rank send 50 messages, each
256K, via standard send to rank 0. Rank 0 uses standard receives on
them all, pulling in all 50 messages in order from rank 1, then from
rank 2, etc.
John Cary sent out a C++ code on this same e-mail thread. It sends
256*8=2048-byte messages. Each nonzero rank sends 1 message and rank 0
pulls these in in rank order. Then there is a barrier. The program
iterates on this pattern.
I can imagine the two programs are illustrating two different problems.