Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Terry D. Dontje (Terry.Dontje_at_[hidden])
Date: 2007-08-29 09:40:30


I have a program that does a simple bucket brigade of sends and receives
where rank 0 is the start and repeatedly sends to rank 1 until a certain
amount of time has passed and then it sends and all done packet.

Running this under np=2 always works. However, when I run with greater
than 2 using only the SM btl the program usually hangs and one of the
processes has a long stack that has a lot of the following 3 calls in it:

 [25] opal_progress(), line 187 in "opal_progress.c"
  [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
  [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it looks like
the fifo has overflowed.

I am wondering if what is happening is rank 0 has sent a bunch of
messages that have exhausted the
resources such that one of the middle ranks which is in the process of
sending cannot send and therefore
never gets to the point of trying to receive the messages from rank 0?

Is the above a possible scenario or are messages periodically bled off
the SM BTL's fifos?

Note, I have seen np=3 pass sometimes and I can get it to pass reliably
if I raise the shared memory space used by the BTL. This is using the
trunk.

--td