Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with MPI_BARRIER
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2011-09-08 10:43:44


  I agree sentimentally with Ghislain. The time spent in a barrier
should conceptually be some wait time, which can be very long (possibly
on the order of milliseconds or even seconds), and the time to execute
the barrier operations, which should essentially be "instantaneous" on
some time scale... in any case, very fast (probably on the order of
microseconds). The "laggard" who holds up the barrier operation should
report a very fast time. The other processes might show very much
longer times, depending on how well synchronized they were. At least
one, however, should be "fast".

I agree with others that it'd be nice to know the time units so that we
can judge the speeds.

Anyhow, is this on the first barrier of the program? If one repeats the
timing operation several times in succession, do the barrier times
remain high?

On 09/08/11 10:27, Jai Dayal wrote:
> what tick value are you using (i.e., what units are you using?)
>
> On Thu, Sep 8, 2011 at 10:25 AM, Ghislain Lartigue
> <ghislain.lartigue_at_[hidden] <mailto:ghislain.lartigue_at_[hidden]>> wrote:
>
> Thanks,
>
> I understand this but the delays that I measure are huge compared
> to a classical ack procedure... (1000x more)
> And this is repeatable: as far as I understand it, this shows that
> the network is not involved.
>
> Ghislain.
>
>
> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>
> > I guess you forget to count the "leaving time"(fan-out). When
> everyone
> > hits the barrier, it still needs "ack" to leave. And remember
> in most
> > cases, leader process will send out "acks" in a sequence way.
> It's very
> > possible:
> >
> > P0 barrier time = 29 + send/recv ack 0
> > P1 barrier time = 14 + send ack 0 + send/recv ack 1
> > P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
> >
> > That's your measure time.
> >
> > Teng
> >> This problem as nothing to do with stdout...
> >>
> >> Example with 3 processes:
> >>
> >> P0 hits barrier at t=12
> >> P1 hits barrier at t=27
> >> P2 hits barrier at t=41
> >>
> >> In this situation:
> >> P0 waits 41-12 = 29
> >> P1 waits 41-27 = 14
> >> P2 waits 41-41 = 00
> >
> >
> >
> >> So I should see something like (no ordering is expected):
> >> barrier_time = 14
> >> barrier_time = 00
> >> barrier_time = 29
> >>
> >> But what I see is much more like
> >> barrier_time = 22
> >> barrier_time = 29
> >> barrier_time = 25
> >>
> >> See? No process has a barrier_time equal to zero !!!
> >>
> >>
> >>
> >> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
> >>
> >>> The order in which you see stdout printed from mpirun is not
> necessarily
> >>> reflective of what order things were actually printers.
> Remember that
> >>> the stdout from each MPI process needs to flow through at least 3
> >>> processes and potentially across the network before it is actually
> >>> displayed on mpirun's stdout.
> >>>
> >>> MPI process -> local Open MPI daemon -> mpirun -> printed to
> mpirun's
> >>> stdout
> >>>
> >>> Hence, the ordering of stdout can get transposed.
> >>>
> >>>
> >>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
> >>>
> >>>> Thank you for this explanation but indeed this confirms that
> the LAST
> >>>> process that hits the barrier should go through nearly
> instantaneously
> >>>> (except for the broadcast time for the acknowledgment signal).
> >>>> And this is not what happens in my code : EVERY process waits
> for a
> >>>> very long time before going through the barrier (thousands of
> times
> >>>> more than a broadcast)...
> >>>>
> >>>>
> >>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
> >>>>
> >>>>> Order in which processes hit the barrier is only one factor
> in the
> >>>>> time it takes for that process to finish the barrier.
> >>>>>
> >>>>> An easy way to think of a barrier implementation is a "fan
> in/fan out"
> >>>>> model. When each nonzero rank process calls MPI_BARRIER, it
> sends a
> >>>>> message saying "I have hit the barrier!" (it usually sends
> it to its
> >>>>> parent in a tree of all MPI processes in the communicator,
> but you can
> >>>>> simplify this model and consider that it sends it to rank
> 0). Rank 0
> >>>>> collects all of these messages. When it has messages from all
> >>>>> processes in the communicator, it sends out "ok, you can
> leave the
> >>>>> barrier now" messages (again, it's usually via a tree
> distribution,
> >>>>> but you can pretend that it directly, linearly sends a
> message to each
> >>>>> peer process in the communicator).
> >>>>>
> >>>>> Hence, the time that any individual process spends in the
> communicator
> >>>>> is relative to when every other process enters the
> communicator. But
> >>>>> it's also dependent upon communication speed, congestion in the
> >>>>> network, etc.
> >>>>>
> >>>>>
> >>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> at a given point in my (Fortran90) program, I write:
> >>>>>>
> >>>>>> ===================
> >>>>>> start_time = MPI_Wtime()
> >>>>>> call MPI_BARRIER(...)
> >>>>>> new_time = MPI_Wtime() - start_time
> >>>>>> write(*,*) "barrier time =",new_time
> >>>>>> ==================
> >>>>>>
> >>>>>> and then I run my code...
> >>>>>>
> >>>>>> I expected that the values of "new_time" would range from 0
> to Tmax
> >>>>>> (1700 in my case)
> >>>>>> As I understand it, the first process that hits the barrier
> should
> >>>>>> print Tmax and the last process that hits the barrier
> should print 0
> >>>>>> (or a very low value).
> >>>>>>
> >>>>>> But this is not the case: all processes print values in the
> range
> >>>>>> 1400-1700!
> >>>>>>
> >>>>>> Any explanation?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ghislain.
> >>>>>>
> >>>>>> PS:
> >>>>>> This small code behaves perfectly in other parts of my code...
>