On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
> On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
>> The attached code, is an example where openmpi/1.3.2 will lock up, if
>> ran on 48 cores, of IB (4 cores per node),
>> The code loops over recv from all processors on rank 0 and sends from
>> all other ranks, as far as I know this should work, and I can't see
>> why not.
>> Note yes I know we can do the same thing with a gather, this is a
>> simple case to demonstrate the issue.
>> Note that if I increase the openib eager limit, the program runs,
>> which normally means improper MPI, but I can't on my own figure out
>> the problem with this code.
> What are you increasing the eager limit from and too?
The same value as ethernet on our system,
mpirun --mca btl_openib_eager_limit 655360 --mca
btl_openib_max_send_size 655360 ./a.out
Huge values compared to the defaults, but works,
> There is a
> moderate amount of data flowing and as the receives are made
> synchronously and in order it could be that you there are several
> thousand unexpected messages arriving before the one you are looking
> which will lead to long receive queues and a need to buffer lots of
>> Any input on why code like this locks up.
> If you ran padb against this code when it had locked up you should be
> able to get some more information, in particular the message queues
> rank zero. Hopefully this information would be useful.
> Ashley Pittman.
> Ashley Pittman, Bath, UK.
> Padb - A parallel job inspection tool for cluster computing
> users mailing list