Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM init failures
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-30 17:48:25


Tim Mattox wrote:

>I think I remember setting up the MTT tests on Sif so that tests
>are run both with and without the coll_hierarch component selected.
>The coll_hierarch component stresses code paths and potential
>race conditions in its own way. So, if the problems are showing up
>more frequently for the test runs with the coll_hierarch component
>enabled, then I would check the communicator creation code paths.
>
>
Going back to the subject heading "SM init failures", I looked at a
bunch of the MTT stack traces. E.g., the 143 failures with 20880 on
IU_Sif seen on http://www.open-mpi.org/mtt/index.php?do_redir=973 . If
you look at the failures where "MPI_Init" shows up in the stack trace,
you get one of these two:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x2aaab16a6080
[ 0] /lib64/libpthread.so.0
[ 1] .../install/lib/openmpi/mca_btl_sm.so
[ 2] .../install/lib/openmpi/mca_pml_ob1.so
[ 3] .../install/lib/openmpi/mca_pml_ob1.so
[ 4] .../install/lib/openmpi/mca_coll_tuned.so
[ 5] .../install/lib/openmpi/mca_coll_tuned.so
[ 6] .../install/lib/libmpi.so.0(ompi_comm_split+0xc4)
[ 7] .../install/lib/openmpi/mca_coll_hierarch.so
[ 8] .../install/lib/libmpi.so.0
[ 9] .../install/lib/libmpi.so.0
[10] .../install/lib/libmpi.so.0(MPI_Init+0xf0)

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Invalid permissions (2)
Failing at address: 0x2aaab02d6080
[ 0] /lib64/libpthread.so.0
[ 1] .../install/lib/openmpi/mca_btl_sm.so
[ 2] .../install/lib/libopen-pal.so.0(opal_progress+0x5a)
[ 3] .../install/lib/openmpi/mca_pml_ob1.so
[ 4] .../install/lib/openmpi/mca_coll_hierarch.so
[ 5] .../install/lib/openmpi/mca_coll_hierarch.so
[ 6] .../install/lib/openmpi/mca_coll_hierarch.so
[ 7] .../install/lib/libmpi.so.0
[ 8] .../install/lib/libmpi.so.0
[ 9] .../install/lib/libmpi.so.0(ompi_comm_activate+0xd1)
[10] .../install/lib/libmpi.so.0(ompi_comm_split+0x37b)
[11] .../install/lib/openmpi/mca_coll_hierarch.so
[12] .../install/lib/libmpi.so.0
[13] .../install/lib/libmpi.so.0
[14] .../install/lib/libmpi.so.0(MPI_Init+0x17b)

Anyhow, this seems to me to be related to mca_coll_hierarch.