Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with MPI_BARRIER
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2011-09-08 11:35:34


I should know OMPI better than I do, but generally, when you make an MPI
call, you could be diving into all kinds of other stuff. E.g., with
non-blocking point-to-point operations, a message might make progress
during another MPI call. E.g.,

MPI_Irecv(recv_req)
MPI_Isend(send_req)
MPI_Wait(send_req)
MPI_Wait(recv_req)

A receive is started in one call and completed in another, but it's
quite possible that most of the data transfer (and waiting time) occurs
while the program is in the calls associated with the send. The
accounting gets tricky.

So, I'm guessing during the second barrier, MPI is busy making progress
on the pending non-blocking point-to-point operations, where progress is
possible. It isn't purely a barrier operation.

On 9/8/2011 8:04 AM, Ghislain Lartigue wrote:
> This behavior happens at every call (first and following)
>
>
> Here is my code (simplified):
>
> ================================================================
> start_time = MPI_Wtime()
> call mpi_ext_barrier()
> new_time = MPI_Wtime()-start_time
> write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
> call print_message("CAST GHOST DATA2 LOOP 1 barrier "//trim(local_time),0)
>
> do conn_index_id=1, Nconn(conn_type_id)
>
> ! loop over data
> this_data => block%data
> do while (associated(this_data))
>
> MPI_IRECV(...)
> MPI_ISEND(...)
>
> this_data => this_data%next
> enddo
>
> endif
>
> enddo
>
> enddo
>
> start_time = MPI_Wtime()
> call mpi_ext_barrier()
> new_time = MPI_Wtime()-start_time
> write(local_time,'(F9.1)') new_time*1.0e9_WP/(36.0_WP*36.0_WP*36.0_WP)
> call print_message("CAST GHOST DATA2 LOOP 2 barrier "//trim(local_time),0)
>
> done=.false.
> counter = 0
> do while (.not.done)
> do ireq=1,nreq
> if (recv_req(ireq)/=MPI_REQUEST_NULL) then
> call MPI_TEST(recv_req(ireq),found,mystatus,icommerr)
> if (found) then
> call ....
> counter=counter+1
> endif
> endif
> enddo
> if (counter==nreq) then
> done=.true.
> endif
> enddo
> ================================================================
>
> The first call to the barrier works perfectly fine, but the second one gives the strange behavior...
>
> Ghislain.
>
> Le 8 sept. 2011 à 16:53, Eugene Loh a écrit :
>
>> On 9/8/2011 7:42 AM, Ghislain Lartigue wrote:
>>> I will check that, but as I said in first email, this strange behaviour happens only in one place in my code.
>> Is the strange behavior on the first time, or much later on? (You seem to imply later on, but I thought I'd ask.)
>>
>> I agree the behavior is noteworthy, but it's plausible and there's not enough information to explain it based solely on what you've said.
>>
>> Here is one scenario. I don't know if it applies to you since I know very little about what you're doing. I think with VampirTrace, you can collect performance data into large buffers. Occasionally, the buffers need to be flushed to disk. VampirTrace will wait for a good opportunity to do so -- e.g., a global barrier. So, you execute lots of barriers, but suddenly you hit one where VT wants to flush to disk. This takes a long time and everyone in the barrier spends a long time in the barrier. Then, execution resumes and barrier performance looks again like what it used to look like.
>>
>> Again, there are various scenarios to explain what you see. More information would be needed to decide which applies to you.
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users