So with an Isend your program becomes valid MPI and a very nice illustrarion of why the MPI standard cannot limit envelops (or send/recv descriptors) and why at some point the number of descriptors can blow the limits. It also illustrates how the management of eager messages remains workable. (Not the same as affordable or appropriate. I agree it has serious scaling issues) Let's assume there is managed early arrival space for 10 messages per sender.

Each MPI_Isend generates an envelop that goes to the destination. For your program to unwind properly, every envelop must be delivered to the destination. The first (blocking) MPI_Recv is looking for the tag in the last envelop so if libmpi does not deliver all 5000 envelops per sender, the first MPI_Recv will block forever. It is not acceptable for a valid MPI program to deadlock. If the destination cannot hold all the envelops there is no choice but to fail the job. The standard allows this. The Forum considered it to be better to fail a job than to deadlock it.

If each sender sends its first 10 messages eagerly the send side tokens will be used up and the buffer space at the destination will fill up but not overflow. The senders now fall back to rendevous for their remaining 4990 MPI_Isends. The MPI_Isends cannot block. They send envelops as fast as the loop can run but the user send buffers involved cannot be altered until the waits occur. Once the last sent envelop with tag 5000 arrives and matches the posted MPI_Recv, an OK_to_send goes back to the sender and the data can be moved from the still intact send buffer to the waiting receive buffer. The MPI_Waits for the Isend requests can be done in any order but no send buffer can be changed until the corresponding MPI_Wait returns. No system buffer needed for massage data.

The MPI_Recvs being posted in reverse order (5000,4999 .. 11. ) each ship OK_to_send and data flows directly from send to recv buffers. Finally the MPI_Recvs for tags (10 ... 1) get posted and pull their message data from the early arrival space. The program has unwound correctly and as the early arrival space frees up, credits can be returned to the sender.

Performance discussions aside - the semantic is clean and reliable.

Thanks - Dick

PS - If anyone responds to this I hope you will state clearly whether you want to talk about:

- What does the standard require?
or
- What should the standard require?

Dick Treumann - MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-bounces@open-mpi.org wrote on 02/04/2008 06:04:22 PM:

> Richard,
>
> You're absolutely right. What a shame :) If I have spent less time  
> drawing the boxes around the code I might have noticed the typo. The  
> Send should be an Isend.
>
>    george.
>
> On Feb 4, 2008, at 5:32 PM, Richard Treumann wrote:
>
> > Hi George
> >
> > Sorry - This is not a valid MPI program. It violates the requirement  
> > that a program not depend on there being any system buffering. See  
> > page 32-33 of MPI 1.1
> >
> > Lets simplify to:
> > Task 0:
> > MPI_Recv( from 1 with tag 1)
> > MPI_Recv( from 1 with tag 0)
> >
> > Task 1:
> > MPI_Send(to 0 with tag 0)
> > MPI_Send(to 0 with tag 1)
> >
> > Without any early arrival buffer (or with eager size set to 0) task  
> > 0 will hang in the first MPI_Recv and never post a recv with tag 0.  
> > Task 1 will hang in the MPI_Send with tag 0 because it cannot get  
> > past it until the matching recv is posted by task 0.
> >
> > If there is enough early arrival buffer for the first MPI_Send on  
> > task 1 to complete and the second MPI_Send to be posted the example  
> > will run. Once both sends are posted by task 1, task 0 will harvest  
> > the second send and get out of its first recv. Task 0's second recv  
> > can now pick up the message from the early arrival buffer where it  
> > had to go to let task 1complete send 1 and post send 2.
> >
> > If an application wants to do this kind of order inversion it should  
> > use some non blocking operations. For example, if task 0 posted an  
> > MPI_Irecv for tag 1, an MPI_Recv for tag 0 and lastly, an MPI_Wait  
> > for the Irecv, the example is valid.
> >
> > I am not aware of any case where the standard allows a correct MPI  
> > program to be deadlocked by an implementation limit. It can be  
> > failed if it exceeds a limit but I do not think it is ever OK to hang.
> >
> > Dick
> >
> > Dick Treumann - MPI Team/TCEM
> > IBM Systems & Technology Group
> > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > Tele (845) 433-7846 Fax (845) 433-8363
> >
> >
> > users-bounces@open-mpi.org wrote on 02/04/2008 04:41:21 PM:
> >
> > > Please allow me to slightly modify your example. It still follow the
> > > rules from the MPI standard, so I think it's a 100% standard  
> > compliant
> > > parallel application.
> > >
> > > +------------------------------------------------------------+
> > > |                         task 0:                            |
> > > +------------------------------------------------------------+
> > > | MPI_Init()                                                 |
> > > | sleep(3000)                                                |
> > > | for( msg = 0; msg < 5000; msg++ ) {                        |
> > > |   for( peer = 0; peer < com_size; peer++ ) {               |
> > > |     MPI_Recv( ..., from = peer, tag = (5000 - msg),... );  |
> > > |   }                                                        |
> > > | }                                                          |
> > > +------------------------------------------------------------+
> > >
> > > +------------------------------------------------------------+
> > > |                   task 1 to com_size:                      |
> > > +------------------------------------------------------------+
> > > | MPI_Init()                                                 |
> > > | for( msg = 0; msg < 5000; msg++ ) {                        |
> > > |   MPI_Send( ..., 0, tag = msg, ... );                      |
> > > | }                                                          |
> > > +------------------------------------------------------------+
> > >
> > > Isn't that the flow control will stop the application to run to
> > > completion ? It's easy to write an application that break a  
> > particular
> > > MPI implementation. It doesn't necessarily make this implementation
> > > non standard compliant.
> > >
> > > george.
> > >
> > > On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote:
> > >
> > > > Is what George says accurate? If so, it sounds to me like OpenMPI
> > > > does not comply with the MPI standard on the behavior of eager
> > > > protocol. MPICH is getting dinged in this discussion because they
> > > > have complied with the requirements of the MPI standard. IBM MPI
> > > > also complies with the standard.
> > > >
> > > > If there is any debate about whether the MPI standard does (or
> > > > should) require the behavior I describe below then we should move
> > > > the discussion to the MPI 2.1 Forum and get a clarification.
> > > >
> > > > To me, the MPI standard is clear that a program like this:
> > > >
> > > > task 0:
> > > > MPI_Init
> > > > sleep(3000);
> > > > start receiving messages
> > > >
> > > > each of tasks 1 to n-1:
> > > > MPI_Init
> > > > loop 5000 times
> > > > MPI_Send(small message to 0)
> > > > end loop
> > > >
> > > > May send some small messages eagerly if there is space at task 0  
> > but
> > > > must block each task 1 to n-1 before allowing task 0 to run out of
> > > > eager buffer space. Doing this requires a token or credit  
> > management
> > > > system in which each task has credits for known buffer space at  
> > task
> > > > 0. Each task will send eagerly to task 0 until the sender runs out
> > > > of credits and then must switch to rendezvous protocol. Tasks 1to
> > > > n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking,
> > > > depending on how much buffer space there is at task 0 but they  
> > would
> > > > need to block in some MPI_Send before task 0 blows up.
> > > >
> > > > When task 0 wakes up and begins receiving the early arrivals,  
> > tasks
> > > > 1 to n-1 will unblock and resume looping.. Allowing the user to  
> > shut
> > > > off eager protocol by setting eager size to 0 does not fix the
> > > > standards compliance issue. You must either have no eager protocol
> > > > at all or must have a eager message token/credit strategy.
> > > >
> > > > Dick
> > > >
> > > > Dick Treumann - MPI Team/TCEM
> > > > IBM Systems & Technology Group
> > > > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > > > Tele (845) 433-7846 Fax (845) 433-8363
> > > >
> > > >
> > > > users-bounces@open-mpi.org wrote on 02/03/2008 06:59:38 PM:
> > > >
> > > > > Well ... this is exactly the kind of behavior a high performance
> > > > > application try to achieve isn't it ?
> > > > >
> > > > > The problem here is not the flow control. What you need is to  
> > avoid
> > > > > buffering the messages on the receiver side. Luckily, Open MPI  
> > is
> > > > > entirely configurable at runtime, so this situation is really  
> > easy
> > > > to
> > > > > deal with even at the user level. Set the eager size to zero,  
> > and no
> > > > > buffering on the receiver side will be made. Your program will
> > > > survive
> > > > > as long as there is some available memory on the receiver.
> > > > >
> > > > >    Thanks,
> > > > >      George.
> > > > >
> > > > > On Feb 1, 2008, at 6:32 PM, 8mj6tc902@sneakemail.com wrote:
> > > > >
> > > > > > That would make sense. I able to break OpenMPI by having  
> > Node A
> > > > wait
> > > > > > for
> > > > > > messages from Node B. Node B is in fact sleeping while Node C
> > > > bombards
> > > > > > Node A with a few thousand messages. After a while Node B  
> > wakes
> > > > up and
> > > > > > sends Node A the message it's been waiting on, but Node A  
> > has long
> > > > > > since
> > > > > > been buried and seg faults. If I decrease the number of  
> > messages
> > > > C is
> > > > > > sending, it works properly. This was on OpenMPI 1.2.4 (using I
> > > > think
> > > > > > the
> > > > > > SM BTL (might have been MX or TCP, but certainly not  
> > infiniband. I
> > > > > > could
> > > > > > dig up the test and try again if anyone is seriously curious).
> > > > > >
> > > > > > Trying the same test on MPICH/MX went very very slow (I don't
> > > > think
> > > > > > they
> > > > > > have any clever buffer management) but it didn't crash.
> > > > > >
> > > > > > Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
> > > > > > |openmpi-users/Allow| wrote:
> > > > > >> Hi,
> > > > > >>
> > > > > >> I am readying an openmpi 1.2.5 software stack for use with a
> > > > > >> many-thousand core cluster. I have a question about sending  
> > small
> > > > > >> messages that I hope can be answered on this list.
> > > > > >>
> > > > > >> I was under the impression that if node A wants to send a  
> > small
> > > > MPI
> > > > > >> message to node B, it must have a credit to do so. The credit
> > > > > >> assures A
> > > > > >> that B has enough buffer space to accept the message.  
> > Credits are
> > > > > >> required by the mpi layer regardless of the BTL transport  
> > layer
> > > > used.
> > > > > >>
> > > > > >> I have been told by a Voltaire tech that this is not so, the
> > > > > >> credits are
> > > > > >> used by the infiniband transport layer to reliably send a
> > > > message,
> > > > > >> and
> > > > > >> is not an openmpi feature.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Federico
> > > > > >>
> > > > > >> _______________________________________________
> > > > > >> users mailing list
> > > > > >> users@open-mpi.org
> > > > > >>
http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > > >
> > > > > >
> > > > > > --
> > > > > > --Kris
> > > > > >
> > > > > > 叶ってしまう夢は本当の夢と言えん。
> > > > > > [A dream that comes true can't really be called a dream.]
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > users@open-mpi.org
> > > > > >
http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > >
> > > > > [attachment "smime.p7s" deleted by Richard
> > > > > Treumann/Poughkeepsie/IBM]
> > > > _______________________________________________
> > > > > users mailing list
> > > > > users@open-mpi.org
> > > > >
http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users@open-mpi.org
> > > >
http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > [attachment "smime.p7s" deleted by Richard
> > > Treumann/Poughkeepsie/IBM]  
> > _______________________________________________
> > > users mailing list
> > > users@open-mpi.org
> > >
http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> >
http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> [attachment "smime.p7s" deleted by Richard
> Treumann/Poughkeepsie/IBM] _______________________________________________
> users mailing list
> users@open-mpi.org
>
http://www.open-mpi.org/mailman/listinfo.cgi/users