Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Terry D. Dontje (Terry.Dontje_at_[hidden])
Date: 2007-08-29 11:33:54


Heard you the first time Gleb, just been backed up with other stuff.
Following is the code:

  include "mpif.h"

  character(20) cmd_line_arg ! We'll use the first command-line argument
                                 ! to set the duration of the test.

  real(8) :: duration = 10 ! The default duration (in seconds) can be
                                 ! set here.

  real(8) :: endtime ! This is the time at which we'll end the
                                 ! test.

  integer(8) :: nmsgs = 1 ! We'll count the number of messages sent
                                 ! out from each MPI process. There will be
                                 ! at least one message (at the very end),
                                 ! and we'll count all the others.

  logical :: keep_going = .true. ! This flag says whether to keep going.

  ! Initialize MPI stuff.

  call MPI_Init(ier)
  call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
  call MPI_Comm_size(MPI_COMM_WORLD, np, ier)

  if ( np == 1 ) then

    ! Test to make sure there is at least one other process.

    write(6,*) "Need at least 2 processes."
    write(6,*) "Try resubmitting the job with"
    write(6,*) " 'mpirun -np <np>'"
    write(6,*) "where <np> is at least 2."

  else if ( me == 0 ) then

    ! The first command-line argument is the duration of the test (seconds).

    call get_command_argument(1,cmd_line_arg,len,istat)
    if ( istat == 0 ) read(cmd_line_arg,*) duration

    ! Loop until test is done.

    endtime = MPI_Wtime() + duration ! figure out when to end
    do while ( MPI_Wtime() < endtime )
      call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)
      nmsgs = nmsgs + 1
    end do

    ! Then, send the closing signal.

    keep_going = .false.
    call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)

    ! Write summary information.

    write(6,'("Target duration (seconds):",f18.6)') duration
    write(6,'("# of messages sent in that time:", i12)') nmsgs
    write(6,'("Microseconds per message:", f19.3)') 1.d6 * duration / nmsgs

  else

    ! If you're not Process 0, you need to receive messages
    ! (and possibly relay them onward).

    do while ( keep_going )

      call MPI_Recv(keep_going,1,MPI_LOGICAL,me-1,1,MPI_COMM_WORLD, &
         MPI_STATUS_IGNORE,ier)

      if ( me == np - 1 ) cycle ! The last process only receives
messages.

      call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,MPI_COMM_WORLD,ier)

    end do

  end if

  ! Finalize.

  call MPI_Finalize(ier)

end

Sorry it is in Fortran.

--td
Gleb Natapov wrote:

>On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote:
>
>
>>If you are going to look at it, I will not bother with this.
>>
>>
>I need the code to reproduce the problem. Otherwise I have nothing to
>look at.
>
>
>
>>Rich
>>
>>
>>On 8/29/07 10:47 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>
>>
>>
>>>On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote:
>>>
>>>
>>>>Gleb,
>>>> Are you looking at this ?
>>>>
>>>>
>>>Not today. And I need the code to reproduce the bug. Is this possible?
>>>
>>>
>>>
>>>>Rich
>>>>
>>>>
>>>>On 8/29/07 9:56 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>>
>>>>
>>>>
>>>>>On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:
>>>>>
>>>>>
>>>>>>Is this trunk or 1.2?
>>>>>>
>>>>>>
>>>>>Oops. I should read more carefully :) This is trunk.
>>>>>
>>>>>
>>>>>
>>>>>>On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
>>>>>>
>>>>>>
>>>>>>>I have a program that does a simple bucket brigade of sends and receives
>>>>>>>where rank 0 is the start and repeatedly sends to rank 1 until a certain
>>>>>>>amount of time has passed and then it sends and all done packet.
>>>>>>>
>>>>>>>Running this under np=2 always works. However, when I run with greater
>>>>>>>than 2 using only the SM btl the program usually hangs and one of the
>>>>>>>processes has a long stack that has a lot of the following 3 calls in it:
>>>>>>>
>>>>>>> [25] opal_progress(), line 187 in "opal_progress.c"
>>>>>>> [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
>>>>>>> [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"
>>>>>>>
>>>>>>>When stepping through the ompi_fifo_write_to_head routine it looks like
>>>>>>>the fifo has overflowed.
>>>>>>>
>>>>>>>I am wondering if what is happening is rank 0 has sent a bunch of
>>>>>>>messages that have exhausted the
>>>>>>>resources such that one of the middle ranks which is in the process of
>>>>>>>sending cannot send and therefore
>>>>>>>never gets to the point of trying to receive the messages from rank 0?
>>>>>>>
>>>>>>>Is the above a possible scenario or are messages periodically bled off
>>>>>>>the SM BTL's fifos?
>>>>>>>
>>>>>>>Note, I have seen np=3 pass sometimes and I can get it to pass reliably
>>>>>>>if I raise the shared memory space used by the BTL. This is using the
>>>>>>>trunk.
>>>>>>>
>>>>>>>
>>>>>>>--td
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>
>--
> Gleb.
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>