I will have to correct something here. From what I can see, it appears the
MPI code may not be creating ompi_proc_t structures, but rather creating
arrays of ompi_proc_t* pointers that are then filled with values equal to
the pointers in the ompi_proc_list held inside of ompi/proc/proc.c.
It appears, though, that this may be done in a non-thread-safe manner. When
the arrays are filled by calling ompi_proc_world or ompi_proc_all, the
objects themselves are never OBJ_RETAIN'd. As a result, when the first
thread in the code calls OBJ_RELEASE, the object is removed from the
ompi_proc_list and free'd - but the other threads that called
ompi_proc_world/all still retain pointers that now reference invalid memory.
Perhaps the best path forward is to devise a thread-safe design for this
code area and present it for people's review. I'll see what I can do.
Again, comments are welcomed
On 7/7/08 8:22 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
> I am seeking input before making a change to the ompi/proc/proc.c code to
> resolve the referenced ticket. The change could potentially impact how the
> ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't
> impact you, please ignore the remainder of this note.
> I was asked last week to take a look at ticket #1267, filed by Tim Prins
> several months ago. This ticket references a report on the devel list about
> thread locks when calling comm_spawn and using MPI_Init_thread. The thread
> lock is caused by the constructor/destructor in the ompi_proc_t class
> explicitly removing the referenced ompi_proc_t from the static local global
> ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that
> list operation.
> As far as I can see, Tim correctly resolved the constructor conflict by
> simply removing the thread lock/unlock and list append operation from the
> constructor. A scan of the code shows that OBJ_NEW is -only- called from
> within the ompi/proc/proc.c code, so this won't be an issue.
> However, I noted several issues surrounding the creation and release of
> ompi_proc_t objects that -may- cause problems in making a similar change to
> the destructor to fix the rest of the threading problem. These may have been
> created in response to the list modification code currently existing in the
> ompi_proc_t object destructor - or they may be due to other factors.
> Specifically, the MPI code base outside of ompi/proc/proc.c:
> 1. -never- treats the ompi_proc_t as an opal object. Instead, the code
> specifically calls calloc to create space for the structures, and then
> manually constructs them.
> 2. consistently calls OBJ_RELEASE on the resulting structures, even though
> they were never created as opal objects via OBJ_NEW.
> I confess to being puzzled here as the destructor will attempt to remove the
> referenced ompi_proc_t* pointer from the ompi_proc_list in ompi/proc/proc.c,
> but (since OBJ_NEW wasn't called) that object won't be on the list. Looking
> at the code itself, it appears we rely on the list function to realize that
> this object isn't on the list and ignore the request to remove it. We don't
> check the return code from the opal_list_remove_item, and so ignore the
> returned error.
> My point here is to seek comment about the proposed fix for the problem
> referenced in the ticket. My proposal is to remove the thread lock/unlock
> and list manipulation from the ompi_proc_t destructor. From what I can see
> (as described above), this should not impact the rest of the code base. I
> will then add thread lock/unlock protection explicitly to ompi_proc_finalize
> to protect its list operations.
> It appears to me that this change would also open the way to allowing the
> remainder of the code base to treat ompi_proc_t as an object, using OBJ_NEW
> to correctly and consistently initialize those objects. I note that any
> change to ompi_proc_t today (which has occurred in the not-too-distant
> past!) can create a problem throughout the current code base due to the
> manual construction of this object. This is why we have objects in the first
> place - I suspect people didn't use OBJ_NEW because of the automatic change
> it induced in the ompi_proc_list in ompi/proc/proc.c.
> Any comments would be welcome.
> devel mailing list