Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with MPI_BARRIER
From: Ghislain Lartigue (ghislain.lartigue_at_[hidden])
Date: 2011-09-08 10:42:59


I will check that, but as I said in first email, this strange behaviour happens only in one place in my code.
I have the same time/barrier/time procedure in other places (in the same code) and it works perfectly.

At one place I have the following output (sorted)
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 14.2
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 16.3
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 25.1
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 28.4
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 32.6
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 35.3
.
.
.
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 90.1
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 96.3
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 99.5
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3
<00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3

but in the place that concerns me I have (sorted)
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1386.9
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1401.5
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1412.9
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1414.1
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1419.6
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1428.1
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1430.4
.
.
.
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1632.7
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1635.7
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1660.6
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1685.1
<00>(0) CAST GHOST DATA2 LOOP 2 barrier 1699.2

These are the same units...
You see that in the first place, the "time" to "hit/wait/leave" can be very small compared to the last output...

Le 8 sept. 2011 à 16:35, Teng Ma a écrit :

> You'd better check process-core binding in your case. It looks to me P0
> and P1 on the same node and P2 on another node, which makes ack to P0/P1
> go through share memory and ack to P2 through networking.
> 1000x is very possible. sm latency can be about 0.03microsec. ethernet
> latency is about 20-30 microsec.
>
> Just my guess......
>
> Teng
>> Thanks,
>>
>> I understand this but the delays that I measure are huge compared to a
>> classical ack procedure... (1000x more)
>> And this is repeatable: as far as I understand it, this shows that the
>> network is not involved.
>>
>> Ghislain.
>>
>>
>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>>
>>> I guess you forget to count the "leaving time"(fan-out). When everyone
>>> hits the barrier, it still needs "ack" to leave. And remember in most
>>> cases, leader process will send out "acks" in a sequence way. It's very
>>> possible:
>>>
>>> P0 barrier time = 29 + send/recv ack 0
>>> P1 barrier time = 14 + send ack 0 + send/recv ack 1
>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>>
>>> That's your measure time.
>>>
>>> Teng
>>>> This problem as nothing to do with stdout...
>>>>
>>>> Example with 3 processes:
>>>>
>>>> P0 hits barrier at t=12
>>>> P1 hits barrier at t=27
>>>> P2 hits barrier at t=41
>>>>
>>>> In this situation:
>>>> P0 waits 41-12 = 29
>>>> P1 waits 41-27 = 14
>>>> P2 waits 41-41 = 00
>>>
>>>
>>>
>>>> So I should see something like (no ordering is expected):
>>>> barrier_time = 14
>>>> barrier_time = 00
>>>> barrier_time = 29
>>>>
>>>> But what I see is much more like
>>>> barrier_time = 22
>>>> barrier_time = 29
>>>> barrier_time = 25
>>>>
>>>> See? No process has a barrier_time equal to zero !!!
>>>>
>>>>
>>>>
>>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>>>
>>>>> The order in which you see stdout printed from mpirun is not
>>>>> necessarily
>>>>> reflective of what order things were actually printers. Remember that
>>>>> the stdout from each MPI process needs to flow through at least 3
>>>>> processes and potentially across the network before it is actually
>>>>> displayed on mpirun's stdout.
>>>>>
>>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's
>>>>> stdout
>>>>>
>>>>> Hence, the ordering of stdout can get transposed.
>>>>>
>>>>>
>>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>>>
>>>>>> Thank you for this explanation but indeed this confirms that the LAST
>>>>>> process that hits the barrier should go through nearly
>>>>>> instantaneously
>>>>>> (except for the broadcast time for the acknowledgment signal).
>>>>>> And this is not what happens in my code : EVERY process waits for a
>>>>>> very long time before going through the barrier (thousands of times
>>>>>> more than a broadcast)...
>>>>>>
>>>>>>
>>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
>>>>>>
>>>>>>> Order in which processes hit the barrier is only one factor in the
>>>>>>> time it takes for that process to finish the barrier.
>>>>>>>
>>>>>>> An easy way to think of a barrier implementation is a "fan in/fan
>>>>>>> out"
>>>>>>> model. When each nonzero rank process calls MPI_BARRIER, it sends a
>>>>>>> message saying "I have hit the barrier!" (it usually sends it to its
>>>>>>> parent in a tree of all MPI processes in the communicator, but you
>>>>>>> can
>>>>>>> simplify this model and consider that it sends it to rank 0). Rank
>>>>>>> 0
>>>>>>> collects all of these messages. When it has messages from all
>>>>>>> processes in the communicator, it sends out "ok, you can leave the
>>>>>>> barrier now" messages (again, it's usually via a tree distribution,
>>>>>>> but you can pretend that it directly, linearly sends a message to
>>>>>>> each
>>>>>>> peer process in the communicator).
>>>>>>>
>>>>>>> Hence, the time that any individual process spends in the
>>>>>>> communicator
>>>>>>> is relative to when every other process enters the communicator.
>>>>>>> But
>>>>>>> it's also dependent upon communication speed, congestion in the
>>>>>>> network, etc.
>>>>>>>
>>>>>>>
>>>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> at a given point in my (Fortran90) program, I write:
>>>>>>>>
>>>>>>>> ===================
>>>>>>>> start_time = MPI_Wtime()
>>>>>>>> call MPI_BARRIER(...)
>>>>>>>> new_time = MPI_Wtime() - start_time
>>>>>>>> write(*,*) "barrier time =",new_time
>>>>>>>> ==================
>>>>>>>>
>>>>>>>> and then I run my code...
>>>>>>>>
>>>>>>>> I expected that the values of "new_time" would range from 0 to Tmax
>>>>>>>> (1700 in my case)
>>>>>>>> As I understand it, the first process that hits the barrier should
>>>>>>>> print Tmax and the last process that hits the barrier should print
>>>>>>>> 0
>>>>>>>> (or a very low value).
>>>>>>>>
>>>>>>>> But this is not the case: all processes print values in the range
>>>>>>>> 1400-1700!
>>>>>>>>
>>>>>>>> Any explanation?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ghislain.
>>>>>>>>
>>>>>>>> PS:
>>>>>>>> This small code behaves perfectly in other parts of my code...
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> jsquyres_at_[hidden]
>>>>>>> For corporate legal information go to:
>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> | Teng Ma Univ. of Tennessee |
>>> | tma_at_[hidden] Knoxville, TN |
>>> | http://web.eecs.utk.edu/~tma/ |
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> | Teng Ma Univ. of Tennessee |
> | tma_at_[hidden] Knoxville, TN |
> | http://web.eecs.utk.edu/~tma/ |
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>