Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Terry D. Dontje (Terry.Dontje_at_[hidden])
Date: 2007-08-29 11:36:12


To run the code I usually do "mpirun -np 6 a.out 10" on a 2 core
system. It'll print out the following and then hang:
Target duration (seconds): 10.000000
# of messages sent in that time: 589207
Microseconds per message: 16.972

--td

Terry D. Dontje wrote:

> Heard you the first time Gleb, just been backed up with other stuff.
> Following is the code:
>
> include "mpif.h"
>
> character(20) cmd_line_arg ! We'll use the first command-line
> argument
> ! to set the duration of the test.
>
> real(8) :: duration = 10 ! The default duration (in seconds)
> can be
> ! set here.
>
> real(8) :: endtime ! This is the time at which we'll end the
> ! test.
>
> integer(8) :: nmsgs = 1 ! We'll count the number of messages sent
> ! out from each MPI process. There
> will be
> ! at least one message (at the very end),
> ! and we'll count all the others.
>
> logical :: keep_going = .true. ! This flag says whether to keep going.
>
> ! Initialize MPI stuff.
>
> call MPI_Init(ier)
> call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
> call MPI_Comm_size(MPI_COMM_WORLD, np, ier)
>
> if ( np == 1 ) then
>
> ! Test to make sure there is at least one other process.
>
> write(6,*) "Need at least 2 processes."
> write(6,*) "Try resubmitting the job with"
> write(6,*) " 'mpirun -np <np>'"
> write(6,*) "where <np> is at least 2."
>
> else if ( me == 0 ) then
>
> ! The first command-line argument is the duration of the test
> (seconds).
>
> call get_command_argument(1,cmd_line_arg,len,istat)
> if ( istat == 0 ) read(cmd_line_arg,*) duration
>
> ! Loop until test is done.
>
> endtime = MPI_Wtime() + duration ! figure out when to end
> do while ( MPI_Wtime() < endtime )
> call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)
> nmsgs = nmsgs + 1
> end do
>
> ! Then, send the closing signal.
>
> keep_going = .false.
> call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)
>
> ! Write summary information.
>
> write(6,'("Target duration (seconds):",f18.6)') duration
> write(6,'("# of messages sent in that time:", i12)') nmsgs
> write(6,'("Microseconds per message:", f19.3)') 1.d6 * duration /
> nmsgs
>
> else
>
> ! If you're not Process 0, you need to receive messages
> ! (and possibly relay them onward).
>
> do while ( keep_going )
>
> call MPI_Recv(keep_going,1,MPI_LOGICAL,me-1,1,MPI_COMM_WORLD, &
> MPI_STATUS_IGNORE,ier)
>
> if ( me == np - 1 ) cycle ! The last process only receives
> messages.
>
> call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,MPI_COMM_WORLD,ier)
>
> end do
>
> end if
>
> ! Finalize.
>
> call MPI_Finalize(ier)
>
> end
>
> Sorry it is in Fortran.
>
> --td
> Gleb Natapov wrote:
>
>> On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote:
>>
>>
>>> If you are going to look at it, I will not bother with this.
>>>
>>
>> I need the code to reproduce the problem. Otherwise I have nothing to
>> look at.
>>
>>
>>> Rich
>>>
>>>
>>> On 8/29/07 10:47 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>
>>>
>>>
>>>> On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote:
>>>>
>>>>
>>>>> Gleb,
>>>>> Are you looking at this ?
>>>>>
>>>>
>>>> Not today. And I need the code to reproduce the bug. Is this possible?
>>>>
>>>>
>>>>
>>>>> Rich
>>>>>
>>>>>
>>>>> On 8/29/07 9:56 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:
>>>>>>
>>>>>>
>>>>>>> Is this trunk or 1.2?
>>>>>>>
>>>>>>
>>>>>> Oops. I should read more carefully :) This is trunk.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
>>>>>>>
>>>>>>>
>>>>>>>> I have a program that does a simple bucket brigade of sends and
>>>>>>>> receives
>>>>>>>> where rank 0 is the start and repeatedly sends to rank 1 until
>>>>>>>> a certain
>>>>>>>> amount of time has passed and then it sends and all done packet.
>>>>>>>>
>>>>>>>> Running this under np=2 always works. However, when I run with
>>>>>>>> greater
>>>>>>>> than 2 using only the SM btl the program usually hangs and one
>>>>>>>> of the
>>>>>>>> processes has a long stack that has a lot of the following 3
>>>>>>>> calls in it:
>>>>>>>>
>>>>>>>> [25] opal_progress(), line 187 in "opal_progress.c"
>>>>>>>> [26] mca_btl_sm_component_progress(), line 397 in
>>>>>>>> "btl_sm_component.c"
>>>>>>>> [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"
>>>>>>>>
>>>>>>>> When stepping through the ompi_fifo_write_to_head routine it
>>>>>>>> looks like
>>>>>>>> the fifo has overflowed.
>>>>>>>>
>>>>>>>> I am wondering if what is happening is rank 0 has sent a bunch of
>>>>>>>> messages that have exhausted the
>>>>>>>> resources such that one of the middle ranks which is in the
>>>>>>>> process of
>>>>>>>> sending cannot send and therefore
>>>>>>>> never gets to the point of trying to receive the messages from
>>>>>>>> rank 0?
>>>>>>>>
>>>>>>>> Is the above a possible scenario or are messages periodically
>>>>>>>> bled off
>>>>>>>> the SM BTL's fifos?
>>>>>>>>
>>>>>>>> Note, I have seen np=3 pass sometimes and I can get it to pass
>>>>>>>> reliably
>>>>>>>> if I raise the shared memory space used by the BTL. This is
>>>>>>>> using the
>>>>>>>> trunk.
>>>>>>>>
>>>>>>>>
>>>>>>>> --td
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>
>> --
>> Gleb.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
>