On Oct 11, 2013, at 4:07 AM, George Bosilca <bosilca@icl.utk.edu> wrote:

On Oct 9, 2013, at 15:29 , Ralph Castain <rhc@open-mpi.org> wrote:

IIRC, the concern was with where the thread safety should reside. Some classes (e.g., opal_list) were littered with thread locks for every operation. So if someone implemented thread protection at a higher level (e.g., protecting the list while cycling thru it), then all these lower-level lock/unlock operations were just a waste of cycles.

I tried to find these protections in the basic objects (pal_list_t as you named it) but I failed. I don't see this being the case in any of the versions out there (1.6, 1.7 nor trunk). There are some atomic operations to keep track of the ref counts, but this is a completely different topic.

In the OMPI layer we tried to follow the rule that all calls without capitals are not thread safe (and are functions), while all calls with capitals at macros and are protected. This was a best effort applied where it made sense.

Only one I could find that has been renamed is ompi_free_list_resize, which has been renamed to ompi_free_list_resize_mt as it includes a lock/unlock in it. However, there are many places in the opal and ompi classes where thread locks are being called - this is what we seek to remove.

opal/class/opal_pointer_array.c:45:    OBJ_CONSTRUCT(&array->lock, opal_mutex_t);
opal/class/opal_pointer_array.c:67:    OBJ_DESTRUCT(&array->lock);
opal/class/opal_pointer_array.c:113:    OPAL_THREAD_LOCK(&(table->lock));
opal/class/opal_pointer_array.c:120:            OPAL_THREAD_UNLOCK(&(table->lock));
opal/class/opal_pointer_array.c:149:    OPAL_THREAD_UNLOCK(&(table->lock));
opal/class/opal_pointer_array.c:171:    OPAL_THREAD_LOCK(&(table->lock));
opal/class/opal_pointer_array.c:175:            OPAL_THREAD_UNLOCK(&(table->lock));
opal/class/opal_pointer_array.c:215:    OPAL_THREAD_UNLOCK(&(table->lock));
opal/class/opal_pointer_array.c:248:    OPAL_THREAD_LOCK(&(table->lock));
opal/class/opal_pointer_array.c:251:        OPAL_THREAD_UNLOCK(&(table->lock));
opal/class/opal_pointer_array.c:260:            OPAL_THREAD_UNLOCK(&(table->lock));
opal/class/opal_pointer_array.c:291:    OPAL_THREAD_UNLOCK(&(table->lock));
opal/class/opal_pointer_array.c:297:    OPAL_THREAD_LOCK(&(array->lock));
opal/class/opal_pointer_array.c:300:            OPAL_THREAD_UNLOCK(&(array->lock));
opal/class/opal_pointer_array.c:304:    OPAL_THREAD_UNLOCK(&(array->lock));

However, some people felt that there were places where it helped to have the locking down below. So this was the compromise - use the version that fits your situation.

In most of the cases there is nothing better we can do down than protecting the call itself. 

Personally, I'm not wild about it, but I can live with it. I'd prefer to see no lock/unlock calls in the classes themselves as they are too atomistic, and would have opted for providing a macro version of the function that included the appropriate lock/unlocks around the function.

I'm 100% with you here, I also prefer to see the locks, as this makes errors easier to spot. This is why I'm concerned about moving them outside the view, buried under several levels of macro indirections. I could understand the push if there was an obvious performance or safety benefit, but as I fail to see I was wondering if I was missing something from the "bigger" picture.

Here's how I recollect the discussion. There are thread locks down in many of the opal classes - the opal_pointer_array and opal_list functions have embedded lock/unlock in their operations, and I believe others do too. We talked about our desired threading model and agreed that this was too low down in the stack. For example, looping over an opal_list shouldn't invoke a thread lock/unlock for every opal_list_get_next call - we can just lock the loop and avoid all the performance hit. So we agreed on a higher-level thread protection model where we lock up above where the calls are being made.

However, someone pointed out that there might be times when locking at the lower level made sense. So we agreed that any function that actually might benefit from internal thread protection would have two variants: _mt that had the locks, and _st that did not. I think the rationale against the macro definition is that the lock might occur inside the function - e.g., there might be a conditional branch in the function that required the lock, but not the entire function. Putting the lock around the function would force it to always occur, which letting the lock be inside the function would avoid it unless necessary.

Hence the work plan in the meeting minutes. Someone would go into those classes and make two copies of each class. The _st version would have all its lock/unlock calls removed, while the _mt version would retain them. We would then go into the places where those classes are called and either replace the existing call with _mt (since that was the prior functionality), or introduce a lock around the code and use the _st version (the preferred solution).

The plan eventually is to eliminate the _mt versions except where they specifically are required. Personally, I would expect that to be pretty rare, and so I would prefer an approach that doesn't add _st all over the place - our function names are long enough already!



Anyway, that was the thinking at the meeting last June.

On Oct 9, 2013, at 1:40 AM, George Bosilca <bosilca@icl.utk.edu> wrote:

My concern is that increasing the number of interfaces will not make the code thread safe, as in most cases thread safety is not only a matter of using a _mt version of the basic class object but a matter of a careful manipulation of higher level concepts.

We can hardly use the lack of the _MT function as a reason for not having thread safety in the code. We did have the thread safety a while back without the support of _MT version of all the basic classes. 

So I really wonder what is the rationale behind such an intrusive change of the codebase?


On Oct 8, 2013, at 18:14 , Ralph Castain <rhc@open-mpi.org> wrote:

Hi folks

This was one item from the last devel meeting that can be done independent of other things:

resolution: all opal and orte (and possibly ompi) classes need to have a thread safe and thread-free interface
_st suffix: single thread (i.e., not thread safe variant)
_mt suffix: multi thread (i.e., thread safe variant)
for functions that have both st/mt, they will *both* have suffixes
other functions (that do not have st/mt versions) will be naked names
need to rename all classes that have locking enabled already (e.g., opal_free_list)
so today, we go rename all the functions (e.g., opal_free_list functions get _mt suffix) throughout the code base
as someone needs the _st version, they go create it and use it as they want to
Ralph will do the orte classes
Aurelien will do this for the ompi classes

I believe some of these have been done - I will take care of the ORTE classes this week, so consider this a "heads up" for that change.

devel mailing list

devel mailing list

devel mailing list

devel mailing list