Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: assert() to ensure OBJ_CONSTRUCT'ed objects don't get destroyed
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-03-07 20:19:04


On Mar 7, 2013, at 4:37 PM, George Bosilca <bosilca_at_[hidden]> wrote:

> Please refrain from doing so, the assumption #1 this patch is based on is false. First, OBJ_CONSTRUCT can be run to construct a specific type of object in a preallocated memory region (not only on the stack or heap). In fact, it is the only way we can dynamically initialize an object in a memory allocated outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not necessarily matched with an OBJ_DESTRUCT, it work just fine with OBJ_RELEASE. In fact I use these feature in several places.
>
> An example will be a memory region without a predefined size, that I manipulate as opal_list_item_t. This fragment gets allocated when it's size is know, then gets OBJ_CONSTRUCT'ed and then used. The reference count is playing its role, when nobody is using the object anymore, it will be automatically released. With the change you propose such usage will be prohibited.
>
> The feature you are looking for, the one that might have shorten Ralph's debugging time, is already in the opal_object_t. One should use the cls_init_file_name and cls_init_lineno fields to see where the object was first initialized as these fields are set either by the OBJ_NEW or by the OBJ_CONSTRUCT call.

Not exactly. Consider the case where we have a library - e.g., ORTE. Down in the library, perhaps several function calls down, we receive a pointer to an object. The library, not knowing any better, uses OBJ_RETAIN to indicate that this object is being used and therefore should not be released. It then returns and allows an async procedure to run.

Up above, one caller to the library uses OBJ_NEW to create the object. Thus, the reference count system is in-play and governs when the data goes away.

However, in another place, the caller uses OBJ_CONSTRUCT to initialize the object, and OBJ_DESTRUCTs it when the high-level call returns. In this case, the reference count system is ignored - OBJ_DESTRUCT destroys the object regardless of the reference count. So suddenly the library is working with garbage, with no way to know that it happened.

So now let's be specific to see how your suggestion doesn't solve the problem (I actually had tried it). Consider the OOB operating asynchronously. In the first case, where the opal_buffer_t is created via OBJ_NEW, we can point the message system at the data field in the buffer and just OBJ_RETAIN it.

However, in the second case, the OBJ_RETAIN won't work - the calling code releases the data area, but the OOB has no idea that happened. So the pointer to the data field actually isn't valid any more, but there is no way to detect it. Likewise, looking for opal_object_t info is useless as the fields no longer exist.

Yes, I realize this is an incorrect program - the caller isn't allowed to release data until the async operation concludes. But the fact that a library function, especially one that is hidden down low in the code base, is async may not be immediately apparent. The resulting "bug" is extremely hard to chase down, especially as it almost inevitably is exposed as a race condition.

The goal of this RFC was to make such problems immediately apparent. Perhaps one solution would be to get rid of OBJ_DESTRUCT and just have everyone use OBJ_RELEASE, augmented with a flag to indicate whether or not to free the backing memory. Another might be to have OBJ_DESTRUCT respect ref counts, but Jeff, Brian and I didn't like that one. The proposed solution was another way to approach it that would force the above scenario to be recognized only during debug builds, thus allowing it to be identified and corrected.

Your example use-case is certainly something I hadn't considered - so perhaps we'll have to think of another way to detect my situation while still allowing what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN (or whatever) macro for use in your use-case that corrects the ref count?

>
> George.
>
> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but reasonable.
>
> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
>
>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an object and its ref count goes to 0 (in debug builds only).
>>
>> WHY: To catch bugs.
>>
>> WHERE: opal/class/opal_class.h
>>
>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013
>>
>> MORE DETAIL:
>>
>> On the call on Tuesday, we talked about some ideas for catching bugs with respect to object reference counting. After the call, Brian, Ralph, and I came up with two useful asserts to help catch bugs (in debug builds only):
>>
>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero.
>> 2. When a object is destroyed, its refcount should be exactly 1.
>>
>> This RFC is for #1. The attached patch doesn't seem to cause any problems (and we didn't expect it to). But it's a good addition to the other asserts() that are already in the object code already.
>>
>> As for #2, Ralph has previously found bugs in the ORTE layer that would have been much easier to track down if #2 were in place. I'll send an RFC for #2 when I have managed to fix all the problems that it has found in the OMPI layer... :-)
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> Index: opal/class/opal_object.h
>> ===================================================================
>> --- opal/class/opal_object.h (revision 28147)
>> +++ opal/class/opal_object.h (working copy)
>> @@ -169,7 +169,7 @@
>> * @param NAME Name of the class to initialize
>> */
>> #if OPAL_ENABLE_DEBUG
>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>> #else
>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 }
>> #endif
>> @@ -184,6 +184,10 @@
>> /** Magic ID -- want this to be the very first item in the
>> struct's memory */
>> uint64_t obj_magic_id;
>> + /* flag whether this was initialized using construct
>> + * versus obj_new
>> + */
>> + bool constructed;
>> #endif
>> opal_class_t *obj_class; /**< class descriptor */
>> volatile int32_t obj_reference_count; /**< reference count */
>> @@ -252,6 +256,7 @@
>> object->obj_magic_id = OPAL_OBJ_MAGIC_ID;
>> object->cls_init_file_name = file;
>> object->cls_init_lineno = line;
>> + object->constructed = false;
>> return object;
>> }
>> #define OBJ_NEW(type) \
>> @@ -313,6 +318,8 @@
>> assert(NULL != ((opal_object_t *) (object))->obj_class); \
>> assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (object))->obj_magic_id); \
>> if (0 == opal_obj_update((opal_object_t *) (object), -1)) { \
>> + /* constructed objects are not allowed to reach 0 */ \
>> + assert(!(((opal_object_t *) (object))->constructed)); \
>> OBJ_SET_MAGIC_ID((object), 0); \
>> opal_obj_run_destructors((opal_object_t *) (object)); \
>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>> @@ -344,6 +351,7 @@
>> OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type)); \
>> } while (0)
>>
>> +#if OPAL_ENABLE_DEBUG
>> #define OBJ_CONSTRUCT_INTERNAL(object, type) \
>> do { \
>> OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \
>> @@ -352,11 +360,24 @@
>> } \
>> ((opal_object_t *) (object))->obj_class = (type); \
>> ((opal_object_t *) (object))->obj_reference_count = 1; \
>> + ((opal_object_t *) (object))->constructed = true; \
>> opal_obj_run_constructors((opal_object_t *) (object)); \
>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>> } while (0)
>> +#else
>> +#define OBJ_CONSTRUCT_INTERNAL(object, type) \
>> +do { \
>> + OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \
>> + if (0 == (type)->cls_initialized) { \
>> + opal_class_initialize((type)); \
>> + } \
>> + ((opal_object_t *) (object))->obj_class = (type); \
>> + ((opal_object_t *) (object))->obj_reference_count = 1; \
>> + opal_obj_run_constructors((opal_object_t *) (object)); \
>> + OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>> +} while (0)
>> +#endif
>>
>> -
>> /**
>> * Destruct (finalize) an object that is not dynamically allocated.
>> *
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel