Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Problem with MPI_BARRIER
From: Teng Ma (tma_at_[hidden])
Date: 2011-09-08 11:01:05


do

barrier/time/barrier/time

and run your code again.

Teng
> I will check that, but as I said in first email, this strange behaviour
> happens only in one place in my code.
> I have the same time/barrier/time procedure in other places (in the same
> code) and it works perfectly.
>
> At one place I have the following output (sorted)
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 14.2
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 16.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 25.1
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 28.4
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 32.6
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 35.3
> .
> .
> .
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 90.1
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 96.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 99.5
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3
>
> but in the place that concerns me I have (sorted)
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1386.9
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1401.5
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1412.9
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1414.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1419.6
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1428.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1430.4
> .
> .
> .
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1632.7
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1635.7
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1660.6
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1685.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier 1699.2
>
>
> These are the same units...
> You see that in the first place, the "time" to "hit/wait/leave" can be
> very small compared to the last output...
>
>
> Le 8 sept. 2011 à 16:35, Teng Ma a écrit :
>
>> You'd better check process-core binding in your case. It looks to me P0
>> and P1 on the same node and P2 on another node, which makes ack to P0/P1
>> go through share memory and ack to P2 through networking.
>> 1000x is very possible. sm latency can be about 0.03microsec. ethernet
>> latency is about 20-30 microsec.
>>
>> Just my guess......
>>
>> Teng
>>> Thanks,
>>>
>>> I understand this but the delays that I measure are huge compared to a
>>> classical ack procedure... (1000x more)
>>> And this is repeatable: as far as I understand it, this shows that the
>>> network is not involved.
>>>
>>> Ghislain.
>>>
>>>
>>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>>>
>>>> I guess you forget to count the "leaving time"(fan-out). When
>>>> everyone
>>>> hits the barrier, it still needs "ack" to leave. And remember in most
>>>> cases, leader process will send out "acks" in a sequence way. It's
>>>> very
>>>> possible:
>>>>
>>>> P0 barrier time = 29 + send/recv ack 0
>>>> P1 barrier time = 14 + send ack 0 + send/recv ack 1
>>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>>>
>>>> That's your measure time.
>>>>
>>>> Teng
>>>>> This problem as nothing to do with stdout...
>>>>>
>>>>> Example with 3 processes:
>>>>>
>>>>> P0 hits barrier at t=12
>>>>> P1 hits barrier at t=27
>>>>> P2 hits barrier at t=41
>>>>>
>>>>> In this situation:
>>>>> P0 waits 41-12 = 29
>>>>> P1 waits 41-27 = 14
>>>>> P2 waits 41-41 = 00
>>>>
>>>>
>>>>
>>>>> So I should see something like (no ordering is expected):
>>>>> barrier_time = 14
>>>>> barrier_time = 00
>>>>> barrier_time = 29
>>>>>
>>>>> But what I see is much more like
>>>>> barrier_time = 22
>>>>> barrier_time = 29
>>>>> barrier_time = 25
>>>>>
>>>>> See? No process has a barrier_time equal to zero !!!
>>>>>
>>>>>
>>>>>
>>>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>>>>
>>>>>> The order in which you see stdout printed from mpirun is not
>>>>>> necessarily
>>>>>> reflective of what order things were actually printers. Remember
>>>>>> that
>>>>>> the stdout from each MPI process needs to flow through at least 3
>>>>>> processes and potentially across the network before it is actually
>>>>>> displayed on mpirun's stdout.
>>>>>>
>>>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to
>>>>>> mpirun's
>>>>>> stdout
>>>>>>
>>>>>> Hence, the ordering of stdout can get transposed.
>>>>>>
>>>>>>
>>>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>>>>
>>>>>>> Thank you for this explanation but indeed this confirms that the
>>>>>>> LAST
>>>>>>> process that hits the barrier should go through nearly
>>>>>>> instantaneously
>>>>>>> (except for the broadcast time for the acknowledgment signal).
>>>>>>> And this is not what happens in my code : EVERY process waits for a
>>>>>>> very long time before going through the barrier (thousands of times
>>>>>>> more than a broadcast)...
>>>>>>>
>>>>>>>
>>>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
>>>>>>>
>>>>>>>> Order in which processes hit the barrier is only one factor in the
>>>>>>>> time it takes for that process to finish the barrier.
>>>>>>>>
>>>>>>>> An easy way to think of a barrier implementation is a "fan in/fan
>>>>>>>> out"
>>>>>>>> model. When each nonzero rank process calls MPI_BARRIER, it sends
>>>>>>>> a
>>>>>>>> message saying "I have hit the barrier!" (it usually sends it to
>>>>>>>> its
>>>>>>>> parent in a tree of all MPI processes in the communicator, but you
>>>>>>>> can
>>>>>>>> simplify this model and consider that it sends it to rank 0).
>>>>>>>> Rank
>>>>>>>> 0
>>>>>>>> collects all of these messages. When it has messages from all
>>>>>>>> processes in the communicator, it sends out "ok, you can leave the
>>>>>>>> barrier now" messages (again, it's usually via a tree
>>>>>>>> distribution,
>>>>>>>> but you can pretend that it directly, linearly sends a message to
>>>>>>>> each
>>>>>>>> peer process in the communicator).
>>>>>>>>
>>>>>>>> Hence, the time that any individual process spends in the
>>>>>>>> communicator
>>>>>>>> is relative to when every other process enters the communicator.
>>>>>>>> But
>>>>>>>> it's also dependent upon communication speed, congestion in the
>>>>>>>> network, etc.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 8, 2011, at 6:20 AM, Ghislain Lartigue wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> at a given point in my (Fortran90) program, I write:
>>>>>>>>>
>>>>>>>>> ===================
>>>>>>>>> start_time = MPI_Wtime()
>>>>>>>>> call MPI_BARRIER(...)
>>>>>>>>> new_time = MPI_Wtime() - start_time
>>>>>>>>> write(*,*) "barrier time =",new_time
>>>>>>>>> ==================
>>>>>>>>>
>>>>>>>>> and then I run my code...
>>>>>>>>>
>>>>>>>>> I expected that the values of "new_time" would range from 0 to
>>>>>>>>> Tmax
>>>>>>>>> (1700 in my case)
>>>>>>>>> As I understand it, the first process that hits the barrier
>>>>>>>>> should
>>>>>>>>> print Tmax and the last process that hits the barrier should
>>>>>>>>> print
>>>>>>>>> 0
>>>>>>>>> (or a very low value).
>>>>>>>>>
>>>>>>>>> But this is not the case: all processes print values in the range
>>>>>>>>> 1400-1700!
>>>>>>>>>
>>>>>>>>> Any explanation?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ghislain.
>>>>>>>>>
>>>>>>>>> PS:
>>>>>>>>> This small code behaves perfectly in other parts of my code...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jeff Squyres
>>>>>>>> jsquyres_at_[hidden]
>>>>>>>> For corporate legal information go to:
>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> jsquyres_at_[hidden]
>>>>>> For corporate legal information go to:
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> | Teng Ma Univ. of Tennessee |
>>>> | tma_at_[hidden] Knoxville, TN |
>>>> | http://web.eecs.utk.edu/~tma/ |
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> | Teng Ma Univ. of Tennessee |
>> | tma_at_[hidden] Knoxville, TN |
>> | http://web.eecs.utk.edu/~tma/ |
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

| Teng Ma Univ. of Tennessee |
| tma_at_[hidden] Knoxville, TN |
| http://web.eecs.utk.edu/~tma/ |