Responding to both of Ralph's e-mails in one, just to confuse people :).
First, the issue of the recursive locks... Back in the day, ompi_proc_t
instances could be created as a side effect of other operations.
Therefore, to maintain sanity, the procs were implicitly added to the
master proc list when the proc was created.
When I reworked the modex last year, I also changed the proc code so that
procs are never implicitly created. There's only two ways to get a new
proc to come into existence -- ompi_proc_init() and ompi_proc_unpack().
Therefore, there's no need for the proc construct to implicitly add the
proc to the procs list -- there's only two places in the proc code that
need to change if it isn't (solving the recusive locking problem).
Ralph already committed code to the trunk to deal with this case, although
there are a couple of changes that Jeff and Ralph are working out. For
removal, it ended up making more sense to remove the procs from the list
during the destructor, as it will allow us to remove procs from the system
in comm_disconnect situations in the future (although we're not actually
going to do that until after v1.3).
As for thread safety with the return from ompi_proc_all and
ompi_proc_world. Today, the ompi_proc code has one ref count held for it,
which is set in ompi_proc_init or ompi_proc_unpack and not released until
ompi_proc_finalize. So there's no way to remove a proc from the system
before MPI_FINALIZE, unless something bad has already happened.
Therefore, the lists are (today) thread safe. ompi_proc_world and
ompi_proc_self only return procs created by ompi_proc_init, and there's no
way a correct MPI program could ever cause the procs in that list to reach
a ref count of zero before MPI_FINALIZE, so they are always thread safe in
There is a problem which I've outlined to Jeff and Ralph with the thread
safety of ompi_proc_all() if the accept/connect code is changed such that
procs can be destroyed before ompi_proc_finalize(). This can't happen
today, so there is no thread safety issue with ompi_proc_all() today. In
the future, there might be, although the solution is fairly simple (and
Jeff and Ralph know what I propsed and will implement it before implentng
the release procs before finalize thing, which has other implications to
the PML/BML/BTL that I won't go into here.
On Mon, 7 Jul 2008, Ralph H Castain wrote:
> I will have to correct something here. From what I can see, it appears the
> MPI code may not be creating ompi_proc_t structures, but rather creating
> arrays of ompi_proc_t* pointers that are then filled with values equal to
> the pointers in the ompi_proc_list held inside of ompi/proc/proc.c.
> It appears, though, that this may be done in a non-thread-safe manner. When
> the arrays are filled by calling ompi_proc_world or ompi_proc_all, the
> objects themselves are never OBJ_RETAIN'd. As a result, when the first
> thread in the code calls OBJ_RELEASE, the object is removed from the
> ompi_proc_list and free'd - but the other threads that called
> ompi_proc_world/all still retain pointers that now reference invalid memory.
> Perhaps the best path forward is to devise a thread-safe design for this
> code area and present it for people's review. I'll see what I can do.
> Again, comments are welcomed
> On 7/7/08 8:22 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
>> I am seeking input before making a change to the ompi/proc/proc.c code to
>> resolve the referenced ticket. The change could potentially impact how the
>> ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't
>> impact you, please ignore the remainder of this note.
>> I was asked last week to take a look at ticket #1267, filed by Tim Prins
>> several months ago. This ticket references a report on the devel list about
>> thread locks when calling comm_spawn and using MPI_Init_thread. The thread
>> lock is caused by the constructor/destructor in the ompi_proc_t class
>> explicitly removing the referenced ompi_proc_t from the static local global
>> ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that
>> list operation.
>> As far as I can see, Tim correctly resolved the constructor conflict by
>> simply removing the thread lock/unlock and list append operation from the
>> constructor. A scan of the code shows that OBJ_NEW is -only- called from
>> within the ompi/proc/proc.c code, so this won't be an issue.
>> However, I noted several issues surrounding the creation and release of
>> ompi_proc_t objects that -may- cause problems in making a similar change to
>> the destructor to fix the rest of the threading problem. These may have been
>> created in response to the list modification code currently existing in the
>> ompi_proc_t object destructor - or they may be due to other factors.
>> Specifically, the MPI code base outside of ompi/proc/proc.c:
>> 1. -never- treats the ompi_proc_t as an opal object. Instead, the code
>> specifically calls calloc to create space for the structures, and then
>> manually constructs them.
>> 2. consistently calls OBJ_RELEASE on the resulting structures, even though
>> they were never created as opal objects via OBJ_NEW.
>> I confess to being puzzled here as the destructor will attempt to remove the
>> referenced ompi_proc_t* pointer from the ompi_proc_list in ompi/proc/proc.c,
>> but (since OBJ_NEW wasn't called) that object won't be on the list. Looking
>> at the code itself, it appears we rely on the list function to realize that
>> this object isn't on the list and ignore the request to remove it. We don't
>> check the return code from the opal_list_remove_item, and so ignore the
>> returned error.
>> My point here is to seek comment about the proposed fix for the problem
>> referenced in the ticket. My proposal is to remove the thread lock/unlock
>> and list manipulation from the ompi_proc_t destructor. From what I can see
>> (as described above), this should not impact the rest of the code base. I
>> will then add thread lock/unlock protection explicitly to ompi_proc_finalize
>> to protect its list operations.
>> It appears to me that this change would also open the way to allowing the
>> remainder of the code base to treat ompi_proc_t as an object, using OBJ_NEW
>> to correctly and consistently initialize those objects. I note that any
>> change to ompi_proc_t today (which has occurred in the not-too-distant
>> past!) can create a problem throughout the current code base due to the
>> manual construction of this object. This is why we have objects in the first
>> place - I suspect people didn't use OBJ_NEW because of the automatic change
>> it induced in the ompi_proc_list in ompi/proc/proc.c.
>> Any comments would be welcome.
>> devel mailing list
> devel mailing list