Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi credits for eager messages
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-02-04 18:04:22


Richard,

You're absolutely right. What a shame :) If I have spent less time
drawing the boxes around the code I might have noticed the typo. The
Send should be an Isend.

   george.

On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote:

> Hi George
>
> Sorry - This is not a valid MPI program. It violates the requirement
> that a program not depend on there being any system buffering. See
> page 32-33 of MPI 1.1
>
> Lets simplify to:
> Task 0:
> MPI_Recv( from 1 with tag 1)
> MPI_Recv( from 1 with tag 0)
>
> Task 1:
> MPI_Send(to 0 with tag 0)
> MPI_Send(to 0 with tag 1)
>
> Without any early arrival buffer (or with eager size set to 0) task
> 0 will hang in the first MPI_Recv and never post a recv with tag 0.
> Task 1 will hang in the MPI_Send with tag 0 because it cannot get
> past it until the matching recv is posted by task 0.
>
> If there is enough early arrival buffer for the first MPI_Send on
> task 1 to complete and the second MPI_Send to be posted the example
> will run. Once both sends are posted by task 1, task 0 will harvest
> the second send and get out of its first recv. Task 0's second recv
> can now pick up the message from the early arrival buffer where it
> had to go to let task 1complete send 1 and post send 2.
>
> If an application wants to do this kind of order inversion it should
> use some non blocking operations. For example, if task 0 posted an
> MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait
> for the Irecv, the example is valid.
>
> I am not aware of any case where the standard allows a correct MPI
> program to be deadlocked by an implementation limit. It can be
> failed if it exceeds a limit but I do not think it is ever OK to hang.
>
> Dick
>
> Dick Treumann - MPI Team/TCEM
> IBM Systems & Technology Group
> Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846 Fax (845) 433-8363
>
>
> users-bounces_at_[hidden] wrote on 02/04/2008 04:41:21 PM:
>
> > Please allow me to slightly modify your example. It still follow the
> > rules from the MPI standard, so I think it's a 100% standard
> compliant
> > parallel application.
> >
> > +------------------------------------------------------------+
> > | task 0: |
> > +------------------------------------------------------------+
> > | MPI_Init() |
> > | sleep(3000) |
> > | for( msg = 0; msg < 5000; msg++ ) { |
> > | for( peer = 0; peer < com_size; peer++ ) { |
> > | MPI_Recv( ..., from = peer, tag = (5000 - msg),... ); |
> > | } |
> > | } |
> > +------------------------------------------------------------+
> >
> > +------------------------------------------------------------+
> > | task 1 to com_size: |
> > +------------------------------------------------------------+
> > | MPI_Init() |
> > | for( msg = 0; msg < 5000; msg++ ) { |
> > | MPI_Send( ..., 0, tag = msg, ... ); |
> > | } |
> > +------------------------------------------------------------+
> >
> > Isn't that the flow control will stop the application to run to
> > completion ? It's easy to write an application that break a
> particular
> > MPI implementation. It doesn't necessarily make this implementation
> > non standard compliant.
> >
> > george.
> >
> > On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote:
> >
> > > Is what George says accurate? If so, it sounds to me like OpenMPI
> > > does not comply with the MPI standard on the behavior of eager
> > > protocol. MPICH is getting dinged in this discussion because they
> > > have complied with the requirements of the MPI standard. IBM MPI
> > > also complies with the standard.
> > >
> > > If there is any debate about whether the MPI standard does (or
> > > should) require the behavior I describe below then we should move
> > > the discussion to the MPI 2.1 Forum and get a clarification.
> > >
> > > To me, the MPI standard is clear that a program like this:
> > >
> > > task 0:
> > > MPI_Init
> > > sleep(3000);
> > > start receiving messages
> > >
> > > each of tasks 1 to n-1:
> > > MPI_Init
> > > loop 5000 times
> > > MPI_Send(small message to 0)
> > > end loop
> > >
> > > May send some small messages eagerly if there is space at task 0
> but
> > > must block each task 1 to n-1 before allowing task 0 to run out of
> > > eager buffer space. Doing this requires a token or credit
> management
> > > system in which each task has credits for known buffer space at
> task
> > > 0. Each task will send eagerly to task 0 until the sender runs out
> > > of credits and then must switch to rendezvous protocol. Tasks 1to
> > > n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking,
> > > depending on how much buffer space there is at task 0 but they
> would
> > > need to block in some MPI_Send before task 0 blows up.
> > >
> > > When task 0 wakes up and begins receiving the early arrivals,
> tasks
> > > 1 to n-1 will unblock and resume looping.. Allowing the user to
> shut
> > > off eager protocol by setting eager size to 0 does not fix the
> > > standards compliance issue. You must either have no eager protocol
> > > at all or must have a eager message token/credit strategy.
> > >
> > > Dick
> > >
> > > Dick Treumann - MPI Team/TCEM
> > > IBM Systems & Technology Group
> > > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > > Tele (845) 433-7846 Fax (845) 433-8363
> > >
> > >
> > > users-bounces_at_[hidden] wrote on 02/03/2008 06:59:38 PM:
> > >
> > > > Well ... this is exactly the kind of behavior a high performance
> > > > application try to achieve isn't it ?
> > > >
> > > > The problem here is not the flow control. What you need is to
> avoid
> > > > buffering the messages on the receiver side. Luckily, Open MPI
> is
> > > > entirely configurable at runtime, so this situation is really
> easy
> > > to
> > > > deal with even at the user level. Set the eager size to zero,
> and no
> > > > buffering on the receiver side will be made. Your program will
> > > survive
> > > > as long as there is some available memory on the receiver.
> > > >
> > > > Thanks,
> > > > George.
> > > >
> > > > On Feb 1, 2008, at 6:32 PM, 8mj6tc902_at_[hidden] wrote:
> > > >
> > > > > That would make sense. I able to break OpenMPI by having
> Node A
> > > wait
> > > > > for
> > > > > messages from Node B. Node B is in fact sleeping while Node C
> > > bombards
> > > > > Node A with a few thousand messages. After a while Node B
> wakes
> > > up and
> > > > > sends Node A the message it's been waiting on, but Node A
> has long
> > > > > since
> > > > > been buried and seg faults. If I decrease the number of
> messages
> > > C is
> > > > > sending, it works properly. This was on OpenMPI 1.2.4 (using I
> > > think
> > > > > the
> > > > > SM BTL (might have been MX or TCP, but certainly not
> infiniband. I
> > > > > could
> > > > > dig up the test and try again if anyone is seriously curious).
> > > > >
> > > > > Trying the same test on MPICH/MX went very very slow (I don't
> > > think
> > > > > they
> > > > > have any clever buffer management) but it didn't crash.
> > > > >
> > > > > Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
> > > > > |openmpi-users/Allow| wrote:
> > > > >> Hi,
> > > > >>
> > > > >> I am readying an openmpi 1.2.5 software stack for use with a
> > > > >> many-thousand core cluster. I have a question about sending
> small
> > > > >> messages that I hope can be answered on this list.
> > > > >>
> > > > >> I was under the impression that if node A wants to send a
> small
> > > MPI
> > > > >> message to node B, it must have a credit to do so. The credit
> > > > >> assures A
> > > > >> that B has enough buffer space to accept the message.
> Credits are
> > > > >> required by the mpi layer regardless of the BTL transport
> layer
> > > used.
> > > > >>
> > > > >> I have been told by a Voltaire tech that this is not so, the
> > > > >> credits are
> > > > >> used by the infiniband transport layer to reliably send a
> > > message,
> > > > >> and
> > > > >> is not an openmpi feature.
> > > > >>
> > > > >> Thanks,
> > > > >> Federico
> > > > >>
> > > > >> _______________________________________________
> > > > >> users mailing list
> > > > >> users_at_[hidden]
> > > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > >
> > > > >
> > > > > --
> > > > > --Kris
> > > > >
> > > > > 叶ってしまう夢は本当の夢と言えん。
> > > > > [A dream that comes true can't really be called a dream.]
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > > [attachment "smime.p7s" deleted by Richard
> > > > Treumann/Poughkeepsie/IBM]
> > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > [attachment "smime.p7s" deleted by Richard
> > Treumann/Poughkeepsie/IBM]
> _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pkcs7-signature attachment: smime.p7s