Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with MPI_BARRIER
From: Teng Ma (tma_at_[hidden])
Date: 2011-09-08 10:35:04


You'd better check process-core binding in your case. It looks to me P0
and P1 on the same node and P2 on another node, which makes ack to P0/P1
go through share memory and ack to P2 through networking.
1000x is very possible. sm latency can be about 0.03microsec. ethernet
latency is about 20-30 microsec.

Just my guess......

Teng
> Thanks,
>
> I understand this but the delays that I measure are huge compared to a
> classical ack procedure... (1000x more)
> And this is repeatable: as far as I understand it, this shows that the
> network is not involved.
>
> Ghislain.
>
>
> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>
>> I guess you forget to count the "leaving time"(fan-out). When everyone
>> hits the barrier, it still needs "ack" to leave. And remember in most
>> cases, leader process will send out "acks" in a sequence way. It's very
>> possible:
>>
>> P0 barrier time = 29 + send/recv ack 0
>> P1 barrier time = 14 + send ack 0 + send/recv ack 1
>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>
>> That's your measure time.
>>
>> Teng
>>> This problem as nothing to do with stdout...
>>>
>>> Example with 3 processes:
>>>
>>> P0 hits barrier at t=12
>>> P1 hits barrier at t=27
>>> P2 hits barrier at t=41
>>>
>>> In this situation:
>>> P0 waits 41-12 = 29
>>> P1 waits 41-27 = 14
>>> P2 waits 41-41 = 00
>>
>>
>>
>>> So I should see something like (no ordering is expected):
>>> barrier_time = 14
>>> barrier_time = 00
>>> barrier_time = 29
>>>
>>> But what I see is much more like
>>> barrier_time = 22
>>> barrier_time = 29
>>> barrier_time = 25
>>>
>>> See? No process has a barrier_time equal to zero !!!
>>>
>>>
>>>
>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>>
>>>> The order in which you see stdout printed from mpirun is not
>>>> necessarily
>>>> reflective of what order things were actually printers. Remember that
>>>> the stdout from each MPI process needs to flow through at least 3
>>>> processes and potentially across the network before it is actually
>>>> displayed on mpirun's stdout.
>>>>
>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's
>>>> stdout
>>>>
>>>> Hence, the ordering of stdout can get transposed.
>>>>
>>>>
>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>>
>>>>> Thank you for this explanation but indeed this confirms that the LAST
>>>>> process that hits the barrier should go through nearly
>>>>> instantaneously
>>>>> (except for the broadcast time for the acknowledgment signal).
>>>>> And this is not what happens in my code : EVERY process waits for a
>>>>> very long time before going through the barrier (thousands of times
>>>>> more than a broadcast)...
>>>>>
>>>>>
>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
>>>>>
>>>>>> Order in which processes hit the barrier is only one factor in the
>>>>>> time it takes for that process to finish the barrier.
>>>>>>
>>>>>> An easy way to think of a barrier implementation is a "fan in/fan
>>>>>> out"
>>>>>> model. When each nonzero rank process calls MPI_BARRIER, it sends a
>>>>>> message saying "I have hit the barrier!" (it usually sends it to its
>>>>>> parent in a tree of all MPI processes in the communicator, but you
>>>>>> can
>>>>>> simplify this model and consider that it sends it to rank 0). Rank
>>>>>> 0
>>>>>> collects all of these messages. When it has messages from all
>>>>>> processes in the communicator, it sends out "ok, you can leave the
>>>>>> barrier now" messages (again, it's usually via a tree distribution,
>>>>>> but you can pretend that it directly, linearly sends a message to
>>>>>> each
>>>>>> peer process in the communicator).
>>>>>>
>>>>>> Hence, the time that any individual process spends in the
>>>>>> communicator
>>>>>> is relative to when every other process enters the communicator.
>>>>>> But
>>>>>> it's also dependent upon communication speed, congestion in the
>>>>>> network, etc.
>>>>>>
>>>>>>
>>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> at a given point in my (Fortran90) program, I write:
>>>>>>>
>>>>>>> ===================
>>>>>>> start_time = MPI_Wtime()
>>>>>>> call MPI_BARRIER(...)
>>>>>>> new_time = MPI_Wtime() - start_time
>>>>>>> write(*,*) "barrier time =",new_time
>>>>>>> ==================
>>>>>>>
>>>>>>> and then I run my code...
>>>>>>>
>>>>>>> I expected that the values of "new_time" would range from 0 to Tmax
>>>>>>> (1700 in my case)
>>>>>>> As I understand it, the first process that hits the barrier should
>>>>>>> print Tmax and the last process that hits the barrier should print
>>>>>>> 0
>>>>>>> (or a very low value).
>>>>>>>
>>>>>>> But this is not the case: all processes print values in the range
>>>>>>> 1400-1700!
>>>>>>>
>>>>>>> Any explanation?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ghislain.
>>>>>>>
>>>>>>> PS:
>>>>>>> This small code behaves perfectly in other parts of my code...
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> jsquyres_at_[hidden]
>>>>>> For corporate legal information go to:
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> | Teng Ma Univ. of Tennessee |
>> | tma_at_[hidden] Knoxville, TN |
>> | http://web.eecs.utk.edu/~tma/ |
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

| Teng Ma Univ. of Tennessee |
| tma_at_[hidden] Knoxville, TN |
| http://web.eecs.utk.edu/~tma/ |