I agree sentimentally with Ghislain.  The time spent in a barrier should conceptually be some wait time, which can be very long (possibly on the order of milliseconds or even seconds), and the time to execute the barrier operations, which should essentially be "instantaneous" on some time scale... in any case, very fast (probably on the order of microseconds).  The "laggard" who holds up the barrier operation should report a very fast time.  The other processes might show very much longer times, depending on how well synchronized they were.  At least one, however, should be "fast".

I agree with others that it'd be nice to know the time units so that we can judge the speeds.

Anyhow, is this on the first barrier of the program?  If one repeats the timing operation several times in succession, do the barrier times remain high?

On 09/08/11 10:27, Jai Dayal wrote:
what tick value are you using (i.e., what units are you using?)

On Thu, Sep 8, 2011 at 10:25 AM, Ghislain Lartigue <ghislain.lartigue@coria.fr> wrote:
Thanks,

I understand this but the delays that I measure are huge compared to a classical ack procedure... (1000x more)
And this is repeatable: as far as I understand it, this shows that the network is not involved.

Ghislain.


Le 8 sept. 2011 à 16:16, Teng Ma a écrit :

> I guess you forget to count the "leaving time"(fan-out).  When everyone
> hits the barrier, it still needs "ack" to leave.  And remember in most
> cases, leader process will send out "acks" in a sequence way.  It's very
> possible:
>
> P0 barrier time = 29 + send/recv ack 0
> P1 barrier time = 14 + send ack 0  + send/recv ack 1
> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>
> That's your measure time.
>
> Teng
>> This problem as nothing to do with stdout...
>>
>> Example with 3 processes:
>>
>> P0 hits barrier at t=12
>> P1 hits barrier at t=27
>> P2 hits barrier at t=41
>>
>> In this situation:
>> P0 waits 41-12 = 29
>> P1 waits 41-27 = 14
>> P2 waits 41-41 = 00
>
>
>
>> So I should see something  like (no ordering is expected):
>> barrier_time = 14
>> barrier_time = 00
>> barrier_time = 29
>>
>> But what I see is much more like
>> barrier_time = 22
>> barrier_time = 29
>> barrier_time = 25
>>
>> See? No process has a barrier_time equal to zero !!!
>>
>>
>>
>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>
>>> The order in which you see stdout printed from mpirun is not necessarily
>>> reflective of what order things were actually printers.  Remember that
>>> the stdout from each MPI process needs to flow through at least 3
>>> processes and potentially across the network before it is actually
>>> displayed on mpirun's stdout.
>>>
>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's
>>> stdout
>>>
>>> Hence, the ordering of stdout can get transposed.
>>>
>>>
>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>
>>>> Thank you for this explanation but indeed this confirms that the LAST
>>>> process that hits the barrier should go through nearly instantaneously
>>>> (except for the broadcast time for the acknowledgment signal).
>>>> And this is not what happens in my code : EVERY process waits for a
>>>> very long time before going through the barrier (thousands of times
>>>> more than a broadcast)...
>>>>
>>>>
>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
>>>>
>>>>> Order in which processes hit the barrier is only one factor in the
>>>>> time it takes for that process to finish the barrier.
>>>>>
>>>>> An easy way to think of a barrier implementation is a "fan in/fan out"
>>>>> model.  When each nonzero rank process calls MPI_BARRIER, it sends a
>>>>> message saying "I have hit the barrier!" (it usually sends it to its
>>>>> parent in a tree of all MPI processes in the communicator, but you can
>>>>> simplify this model and consider that it sends it to rank 0).  Rank 0
>>>>> collects all of these messages.  When it has messages from all
>>>>> processes in the communicator, it sends out "ok, you can leave the
>>>>> barrier now" messages (again, it's usually via a tree distribution,
>>>>> but you can pretend that it directly, linearly sends a message to each
>>>>> peer process in the communicator).
>>>>>
>>>>> Hence, the time that any individual process spends in the communicator
>>>>> is relative to when every other process enters the communicator.  But
>>>>> it's also dependent upon communication speed, congestion in the
>>>>> network, etc.
>>>>>
>>>>>
>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> at a given point in my (Fortran90) program, I write:
>>>>>>
>>>>>> ===================
>>>>>> start_time = MPI_Wtime()
>>>>>> call MPI_BARRIER(...)
>>>>>> new_time = MPI_Wtime() - start_time
>>>>>> write(*,*) "barrier time =",new_time
>>>>>> ==================
>>>>>>
>>>>>> and then I run my code...
>>>>>>
>>>>>> I expected that the values of "new_time" would range from 0 to Tmax
>>>>>> (1700 in my case)
>>>>>> As I understand it, the first process that hits the barrier should
>>>>>> print Tmax and the last process that hits the barrier should print 0
>>>>>> (or a very low value).
>>>>>>
>>>>>> But this is not the case: all processes print values in the range
>>>>>> 1400-1700!
>>>>>>
>>>>>> Any explanation?
>>>>>>
>>>>>> Thanks,
>>>>>> Ghislain.
>>>>>>
>>>>>> PS:
>>>>>> This small code behaves perfectly in other parts of my code...