Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] some info is not pushed into the dstore
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-27 13:13:02


Hi Gilles

I concur on the typo and fixed it - thanks for catching it. I'll have to look into the problem you reported as it has been fixed in the past, and was working last I checked it. The info required for this 3-way connect/accept is supposed to be in the modex provided by the common communicator.

On May 27, 2014, at 3:51 AM, Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:

> Folks,
>
> while debugging the dynamic/intercomm_create from the ibm test suite, i found something odd.
>
> i ran *without* any batch manager on a VM (one socket and four cpus)
> mpirun -np 1 ./dynamic/intercomm_create
>
> it hangs by default
> it works with --mca coll ^ml
>
> basically :
> - task 0 spawns task 1
> - task 0 spawns task 2
> - a communicator is created for the 3 tasks via MPI_Intercomm_create()
>
> MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls ompi_proc_set_locality()
>
> then, on task 1, ompi_proc_set_locality() calls
> opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which fails and this is OK
> then
> opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which fails and this is *not* OK
>
> /* on task 2, the first fetch for "task 1" fails but the second success */
>
> my analysis is that when task 2 was created, it updated its opal_dstore_nonpeer with info from "task 1" which was previously spawned by task 0.
> when task 1 was spawned, task 2 did not exist yet and hence opal_dstore_nonpeer contains no reference to task 2.
> but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been updated, hence the failure
>
> (on task 1, proc_flags of task 2 has incorrect locality, this likely confuses coll ml and hang the test)
>
> should task1 have received new information when task 2 was spawned ?
> shoud task2 have sent information to task1 when it was spawned ?
> should task1 have (tried to) get fresh information before invoking MPI_Intercomm_create() ?
>
> incidentally, i found ompi_proc_set_locality calls opal_dstore.store with
> identifier &proc (the argument is &proc->proc_name everywhere else, so this
> is likely a bug/typo. the attached patch fixes this.
>
> Thanks in advance for your feedback,
>
> Gilles
> <proc.patch>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14848.php