Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] some info is not pushed into the dstore
From: Gilles Gouaillardet (gilles.gouaillardet_at_[hidden])
Date: 2014-05-27 06:51:00


Folks,

while debugging the dynamic/intercomm_create from the ibm test suite, i
found something odd.

i ran *without* any batch manager on a VM (one socket and four cpus)
mpirun -np 1 ./dynamic/intercomm_create

it hangs by default
it works with --mca coll ^ml

basically :
- task 0 spawns task 1
- task 0 spawns task 2
- a communicator is created for the 3 tasks via MPI_Intercomm_create()

MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls
ompi_proc_set_locality()

then, on task 1, ompi_proc_set_locality() calls
opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which
fails and this is OK
then
opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which
fails and this is *not* OK

/* on task 2, the first fetch for "task 1" fails but the second success */

my analysis is that when task 2 was created, it updated its
opal_dstore_nonpeer with info from "task 1" which was previously spawned by
task 0.
when task 1 was spawned, task 2 did not exist yet and hence
opal_dstore_nonpeer contains no reference to task 2.
but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been
updated, hence the failure

(on task 1, proc_flags of task 2 has incorrect locality, this likely
confuses coll ml and hang the test)

should task1 have received new information when task 2 was spawned ?
shoud task2 have sent information to task1 when it was spawned ?
should task1 have (tried to) get fresh information before invoking
MPI_Intercomm_create() ?

incidentally, i found ompi_proc_set_locality calls opal_dstore.store with
identifier &proc (the argument is &proc->proc_name everywhere else, so this
is likely a bug/typo. the attached patch fixes this.

Thanks in advance for your feedback,

Gilles