MPI standard compliant management of eager send requires that this program work. There is nothing that says "unless eager limit is set too high/low." Honoring this requirement in an MPI implementation can be costly. There are practical reasons to pass up this requirement because most applications do not need it.
I would like to see the MPI Forum find a way to relax this requirement and I have made a proposal that would do so that would not invalidate any current MPI program.
I would consider simply removing this requirement if the MPI Forum decides that it is OK for some valid MPI 2.2 programs to be invalid MPI 3.0 programs but I hope the Forum does not go the direction of breaking existing valid MPI programs.
Ashley says below: "If the MPI_SEND isn't blocking then each rank will send 50 messages to rank zero and you'll have 2000 messages ...."
What the standard says is MPI_SEND must block before there are more messages at the destination than it can manage.
I do not think ignoring that the standard requires this program to work is a very good solution.
Here is what the standard says:
Section 3.4 MPI 2.2 page 39:1..7
The send call described in Section 3.2.1 uses the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver.
Section 3.5 MPI 2.2 page 44:8..19
A buffered send operation that cannot complete because of a lack of buffer space is erroneous. When such a situation is detected, an error is signalled that may cause the program to terminate abnormally. On the other hand, a standard send operation that cannot complete because of lack of buffer space will merely block, waiting for buffer space to become available or for a matching receive to be posted. This behavior is preferable in many situations. Consider a situation where a producer repeatedly produces new values and sends them to a consumer. Assume that the producer produces new values faster than the consumer can consume them. If buffered sends are used, then a buffer overflow will result. Additional synchronization has to be added to the program so as to prevent this from occurring. If standard sends are used, then the producer will be automatically throttled, as its send operations will block when buffer space is unavailable.
Note - in the paragraph above "buffered send" means MPI_BSEND, not eager send.
Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
firstname.lastname@example.org wrote on 12/03/2009 05:33:51 AM:
> [image removed]
> Re: [OMPI users] Program deadlocks, on simple send/recv loop
> Ashley Pittman
> Open MPI Users
> 12/03/2009 05:35 AM
> Sent by:
> Please respond to Open MPI Users
> On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
> > On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
> > > On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
> > >> The attached code, is an example where openmpi/1.3.2 will lock up, if
> > >> ran on 48 cores, of IB (4 cores per node),
> > >> The code loops over recv from all processors on rank 0 and sends from
> > >> all other ranks, as far as I know this should work, and I can't see
> > >> why not.
> > >> Note yes I know we can do the same thing with a gather, this is a
> > >> simple case to demonstrate the issue.
> > >> Note that if I increase the openib eager limit, the program runs,
> > >> which normally means improper MPI, but I can't on my own figure out
> > >> the problem with this code.
> > >
> > > What are you increasing the eager limit from and too?
> > The same value as ethernet on our system,
> > mpirun --mca btl_openib_eager_limit 655360 --mca
> > btl_openib_max_send_size 655360 ./a.out
> > Huge values compared to the defaults, but works,
> My understanding of the code is that each message will be 256k long and
> the code pretty much guarantees that at some point there will be 46
> messages in the queue in front of the one you are looking to receive
> which makes a total of 11.5Mb, slightly less if you take shared memory
> into account.
> If the MPI_SEND isn't blocking then each rank will send 50 messages to
> rank zero and you'll have 2000 messages and 500Mb of data being received
> with the message you want being somewhere towards the end of the queue.
> These numbers are far from huge but then compared to an eager limit of
> 64k they aren't small either.
> I suspect the eager limit is being reached on COMM_WORLD rank 0 and it's
> not pulling any more messages off the network pending some of the
> existing ones being out of the queue but they never will be because the
> message being waited for is one that's stuck on the network. As I say
> the message queue for rank 0 when it's deadlocked would be interesting
> to look at.
> In summary this code makes heavy use of unexpected messages and network
> buffering, it's not surprising to me that it only works with eager
> limits set fairly high.
> Ashley Pittman, Bath, UK.
> Padb - A parallel job inspection tool for cluster computing
> users mailing list