Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Terry D. Dontje (Terry.Dontje_at_[hidden])
Date: 2007-08-29 09:40:30

I have a program that does a simple bucket brigade of sends and receives
where rank 0 is the start and repeatedly sends to rank 1 until a certain
amount of time has passed and then it sends and all done packet.

Running this under np=2 always works. However, when I run with greater
than 2 using only the SM btl the program usually hangs and one of the
processes has a long stack that has a lot of the following 3 calls in it:

 [25] opal_progress(), line 187 in "opal_progress.c"
  [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
  [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it looks like
the fifo has overflowed.

I am wondering if what is happening is rank 0 has sent a bunch of
messages that have exhausted the
resources such that one of the middle ranks which is in the process of
sending cannot send and therefore
never gets to the point of trying to receive the messages from rank 0?

Is the above a possible scenario or are messages periodically bled off
the SM BTL's fifos?

Note, I have seen np=3 pass sometimes and I can get it to pass reliably
if I raise the shared memory space used by the BTL. This is using the