Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi credits for eager messages
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-02-04 16:41:21


Please allow me to slightly modify your example. It still follow the
rules from the MPI standard, so I think it's a 100% standard compliant
parallel application.

+------------------------------------------------------------+
| task 0: |
+------------------------------------------------------------+
| MPI_Init() |
| sleep(3000) |
| for( msg = 0; msg < 5000; msg++ ) { |
| for( peer = 0; peer < com_size; peer++ ) { |
| MPI_Recv( ..., from = peer, tag = (5000 - msg),... ); |
| } |
| } |
+------------------------------------------------------------+

+------------------------------------------------------------+
| task 1 to com_size: |
+------------------------------------------------------------+
| MPI_Init() |
| for( msg = 0; msg < 5000; msg++ ) { |
| MPI_Send( ..., 0, tag = msg, ... ); |
| } |
+------------------------------------------------------------+

Isn't that the flow control will stop the application to run to
completion ? It's easy to write an application that break a particular
MPI implementation. It doesn't necessarily make this implementation
non standard compliant.

george.

On Feb 4, 2008, at 9:08 AM, Richard Treumann wrote:

> Is what George says accurate? If so, it sounds to me like OpenMPI
> does not comply with the MPI standard on the behavior of eager
> protocol. MPICH is getting dinged in this discussion because they
> have complied with the requirements of the MPI standard. IBM MPI
> also complies with the standard.
>
> If there is any debate about whether the MPI standard does (or
> should) require the behavior I describe below then we should move
> the discussion to the MPI 2.1 Forum and get a clarification.
>
> To me, the MPI standard is clear that a program like this:
>
> task 0:
> MPI_Init
> sleep(3000);
> start receiving messages
>
> each of tasks 1 to n-1:
> MPI_Init
> loop 5000 times
> MPI_Send(small message to 0)
> end loop
>
> May send some small messages eagerly if there is space at task 0 but
> must block each task 1 to n-1 before allowing task 0 to run out of
> eager buffer space. Doing this requires a token or credit management
> system in which each task has credits for known buffer space at task
> 0. Each task will send eagerly to task 0 until the sender runs out
> of credits and then must switch to rendezvous protocol. Tasks 1to
> n-1 might each do 3 MPI_Sends or 300 MPI_Sends before blocking,
> depending on how much buffer space there is at task 0 but they would
> need to block in some MPI_Send before task 0 blows up.
>
> When task 0 wakes up and begins receiving the early arrivals, tasks
> 1 to n-1 will unblock and resume looping.. Allowing the user to shut
> off eager protocol by setting eager size to 0 does not fix the
> standards compliance issue. You must either have no eager protocol
> at all or must have a eager message token/credit strategy.
>
> Dick
>
> Dick Treumann - MPI Team/TCEM
> IBM Systems & Technology Group
> Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846 Fax (845) 433-8363
>
>
> users-bounces_at_[hidden] wrote on 02/03/2008 06:59:38 PM:
>
> > Well ... this is exactly the kind of behavior a high performance
> > application try to achieve isn't it ?
> >
> > The problem here is not the flow control. What you need is to avoid
> > buffering the messages on the receiver side. Luckily, Open MPI is
> > entirely configurable at runtime, so this situation is really easy
> to
> > deal with even at the user level. Set the eager size to zero, and no
> > buffering on the receiver side will be made. Your program will
> survive
> > as long as there is some available memory on the receiver.
> >
> > Thanks,
> > George.
> >
> > On Feb 1, 2008, at 6:32 PM, 8mj6tc902_at_[hidden] wrote:
> >
> > > That would make sense. I able to break OpenMPI by having Node A
> wait
> > > for
> > > messages from Node B. Node B is in fact sleeping while Node C
> bombards
> > > Node A with a few thousand messages. After a while Node B wakes
> up and
> > > sends Node A the message it's been waiting on, but Node A has long
> > > since
> > > been buried and seg faults. If I decrease the number of messages
> C is
> > > sending, it works properly. This was on OpenMPI 1.2.4 (using I
> think
> > > the
> > > SM BTL (might have been MX or TCP, but certainly not infiniband. I
> > > could
> > > dig up the test and try again if anyone is seriously curious).
> > >
> > > Trying the same test on MPICH/MX went very very slow (I don't
> think
> > > they
> > > have any clever buffer management) but it didn't crash.
> > >
> > > Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com
> > > |openmpi-users/Allow| wrote:
> > >> Hi,
> > >>
> > >> I am readying an openmpi 1.2.5 software stack for use with a
> > >> many-thousand core cluster. I have a question about sending small
> > >> messages that I hope can be answered on this list.
> > >>
> > >> I was under the impression that if node A wants to send a small
> MPI
> > >> message to node B, it must have a credit to do so. The credit
> > >> assures A
> > >> that B has enough buffer space to accept the message. Credits are
> > >> required by the mpi layer regardless of the BTL transport layer
> used.
> > >>
> > >> I have been told by a Voltaire tech that this is not so, the
> > >> credits are
> > >> used by the infiniband transport layer to reliably send a
> message,
> > >> and
> > >> is not an openmpi feature.
> > >>
> > >> Thanks,
> > >> Federico
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> users_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > > --
> > > --Kris
> > >
> > > 叶ってしまう夢は本当の夢と言えん。
> > > [A dream that comes true can't really be called a dream.]
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > [attachment "smime.p7s" deleted by Richard
> > Treumann/Poughkeepsie/IBM]
> _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pkcs7-signature attachment: smime.p7s