Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] SM initialization race condition
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2008-08-21 07:22:09


I've been seeing an intermittent (once every 4 hours looping on a quick
initialization program) segv with the following stack trace.

=>[1] mca_btl_sm_add_procs(btl = 0xfffffd7ffdb67ef0, nprocs = 2U, procs
= 0x591560, peers = 0x591580, reachability = 0xfffffd7fffdff000), line
519 in "btl_sm.c"
  [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints
= 0x591500, reachable = 0xfffffd7fffdff000), line 222 in "bml_r2.c"
  [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in
"pml_ob1.c"
  [4] ompi_mpi_init(argc = 1, argv = 0xfffffd7fffdff318, requested = 0,
provided = 0xfffffd7fffdff234), line 651 in "ompi_mpi_init.c"
  [5] PMPI_Init(argc = 0xfffffd7fffdff2ec, argv = 0xfffffd7fffdff2e0),
line 90 in "pinit.c"
  [6] main(argc = 1, argv = 0xfffffd7fffdff318), line 82 in "buffer.c"

I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains
uninitialized data causes the loop on line 504 in btl_sm.c to think that
a remote rank has set its fifo address.

Has anyone else seen the above happening?

--td