On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
> The attached code, is an example where openmpi/1.3.2 will lock up, if
> ran on 48 cores, of IB (4 cores per node),
> The code loops over recv from all processors on rank 0 and sends from
> all other ranks, as far as I know this should work, and I can't see
> why not.
> Note yes I know we can do the same thing with a gather, this is a
> simple case to demonstrate the issue.
> Note that if I increase the openib eager limit, the program runs,
> which normally means improper MPI, but I can't on my own figure out
> the problem with this code.
What are you increasing the eager limit from and too? There is a
moderate amount of data flowing and as the receives are made
synchronously and in order it could be that you there are several
thousand unexpected messages arriving before the one you are looking for
which will lead to long receive queues and a need to buffer lots of
> Any input on why code like this locks up.
If you ran padb against this code when it had locked up you should be
able to get some more information, in particular the message queues for
rank zero. Hopefully this information would be useful.
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing