Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi credits for eager messages
From: Richard Treumann (treumann_at_[hidden])
Date: 2008-02-04 09:08:45

Is what George says accurate? If so, it sounds to me like OpenMPI does not
comply with the MPI standard on the behavior of eager protocol. MPICH is
getting dinged in this discussion because they have complied with the
requirements of the MPI standard. IBM MPI also complies with the standard.

If there is any debate about whether the MPI standard does (or should)
require the behavior I describe below then we should move the discussion to
the MPI 2.1 Forum and get a clarification.

To me, the MPI standard is clear that a program like this:

task 0:
start receiving messages

each of tasks 1 to n-1:
loop 5000 times
   MPI_Send(small message to 0)
end loop

May send some small messages eagerly if there is space at task 0 but must
block each task 1 to n-1 before allowing task 0 to run out of eager buffer
space. Doing this requires a token or credit management system in which
each task has credits for known buffer space at task 0. Each task will send
eagerly to task 0 until the sender runs out of credits and then must switch
to rendezvous protocol. Tasks 1to n-1 might each do 3 MPI_Sends or 300
MPI_Sends before blocking, depending on how much buffer space there is at
task 0 but they would need to block in some MPI_Send before task 0 blows

When task 0 wakes up and begins receiving the early arrivals, tasks 1 to
n-1 will unblock and resume looping.. Allowing the user to shut off eager
protocol by setting eager size to 0 does not fix the standards compliance
issue. You must either have no eager protocol at all or must have a eager
message token/credit strategy.


Dick Treumann - MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363

users-bounces_at_[hidden] wrote on 02/03/2008 06:59:38 PM:

> Well ... this is exactly the kind of behavior a high performance
> application try to achieve isn't it ?
> The problem here is not the flow control. What you need is to avoid
> buffering the messages on the receiver side. Luckily, Open MPI is
> entirely configurable at runtime, so this situation is really easy to
> deal with even at the user level. Set the eager size to zero, and no
> buffering on the receiver side will be made. Your program will survive
> as long as there is some available memory on the receiver.
> Thanks,
> George.
> On Feb 1, 2008, at 6:32 PM, 8mj6tc902_at_[hidden] wrote:
> > That would make sense. I able to break OpenMPI by having Node A wait
> > for
> > messages from Node B. Node B is in fact sleeping while Node C bombards
> > Node A with a few thousand messages. After a while Node B wakes up and
> > sends Node A the message it's been waiting on, but Node A has long
> > since
> > been buried and seg faults. If I decrease the number of messages C is
> > sending, it works properly. This was on OpenMPI 1.2.4 (using I think
> > the
> > SM BTL (might have been MX or TCP, but certainly not infiniband. I
> > could
> > dig up the test and try again if anyone is seriously curious).
> >
> > Trying the same test on MPICH/MX went very very slow (I don't think
> > they
> > have any clever buffer management) but it didn't crash.
> >
> > Sacerdoti, Federico
> > |openmpi-users/Allow| wrote:
> >> Hi,
> >>
> >> I am readying an openmpi 1.2.5 software stack for use with a
> >> many-thousand core cluster. I have a question about sending small
> >> messages that I hope can be answered on this list.
> >>
> >> I was under the impression that if node A wants to send a small MPI
> >> message to node B, it must have a credit to do so. The credit
> >> assures A
> >> that B has enough buffer space to accept the message. Credits are
> >> required by the mpi layer regardless of the BTL transport layer used.
> >>
> >> I have been told by a Voltaire tech that this is not so, the
> >> credits are
> >> used by the infiniband transport layer to reliably send a message,
> >> and
> >> is not an openmpi feature.
> >>
> >> Thanks,
> >> Federico
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >>
> >
> >
> > --
> > --Kris
> >
> > 叶ってしまう瘢雹夢は本当の夢と言えん。
> > [A dream that comes true can't really be called a dream.]
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> >
> [attachment "smime.p7s" deleted by Richard
> Treumann/Poughkeepsie/IBM]
> users mailing list
> users_at_[hidden]