Is what George says accurate? If so, it sounds to me like OpenMPI does not comply with the MPI standard on the behavior of eager protocol. MPICH is getting dinged in this discussion because they have complied with the requirements of the MPI standard. IBM MPI also complies with the standard.
If there is any debate about whether the MPI standard does (or should) require the behavior I describe below then we should move the discussion to the MPI 2.1 Forum and get a clarification.
To me, the MPI standard is clear that a program like this:
start receiving messages
each of tasks 1 to n-1:
loop 5000 times
MPI_Send(small message to 0)
May send some small messages eagerly if there is space at task 0 but must block each task 1 to n-1 before allowing task 0 to run out of eager buffer space. Doing this requires a token or credit management system in which each task has credits for known buffer space at task 0. Each task will send eagerly to task 0 until the sender runs out of credits and then must switch to rendezvous protocol. Tasks 1to n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking, depending on how much buffer space there is at task 0 but they would need to block in some MPI_Send before task 0 blows up.
When task 0 wakes up and begins receiving the early arrivals, tasks 1 to n-1 will unblock and resume looping.. Allowing the user to shut off eager protocol by setting eager size to 0 does not fix the standards compliance issue. You must either have no eager protocol at all or must have a eager message token/credit strategy.
Dick Treumann - MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
email@example.com wrote on 02/03/2008 06:59:38 PM:
> Well ... this is exactly the kind of behavior a high performance
> application try to achieve isn't it ?
> The problem here is not the flow control. What you need is to avoid
> buffering the messages on the receiver side. Luckily, Open MPI is
> entirely configurable at runtime, so this situation is really easy to
> deal with even at the user level. Set the eager size to zero, and no
> buffering on the receiver side will be made. Your program will survive
> as long as there is some available memory on the receiver.
> On Feb 1, 2008, at 6:32 PM, firstname.lastname@example.org wrote:
> > That would make sense. I able to break OpenMPI by having Node A wait
> > for
> > messages from Node B. Node B is in fact sleeping while Node C bombards
> > Node A with a few thousand messages. After a while Node B wakes up and
> > sends Node A the message it's been waiting on, but Node A has long
> > since
> > been buried and seg faults. If I decrease the number of messages C is
> > sending, it works properly. This was on OpenMPI 1.2.4 (using I think
> > the
> > SM BTL (might have been MX or TCP, but certainly not infiniband. I
> > could
> > dig up the test and try again if anyone is seriously curious).
> > Trying the same test on MPICH/MX went very very slow (I don't think
> > they
> > have any clever buffer management) but it didn't crash.
> > Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
> > |openmpi-users/Allow| wrote:
> >> Hi,
> >> I am readying an openmpi 1.2.5 software stack for use with a
> >> many-thousand core cluster. I have a question about sending small
> >> messages that I hope can be answered on this list.
> >> I was under the impression that if node A wants to send a small MPI
> >> message to node B, it must have a credit to do so. The credit
> >> assures A
> >> that B has enough buffer space to accept the message. Credits are
> >> required by the mpi layer regardless of the BTL transport layer used.
> >> I have been told by a Voltaire tech that this is not so, the
> >> credits are
> >> used by the infiniband transport layer to reliably send a message,
> >> and
> >> is not an openmpi feature.
> >> Thanks,
> >> Federico
> >> _______________________________________________
> >> users mailing list
> >> email@example.com
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > --
> > --Kris
> > $B3p$C$F$7$^$&L4$OK\Ev$NL4$H8@$($s!#(B
> > [A dream that comes true can't really be called a dream.]
> > _______________________________________________
> > users mailing list
> > firstname.lastname@example.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> [attachment "smime.p7s" deleted by Richard
> Treumann/Poughkeepsie/IBM] _______________________________________________
> users mailing list