This problem of assertion is now solved by a patch in ROMIO just commited in

I don't know any other problem in this porting of ROMIO.


Pascal Deveze a écrit :
Jeff Squyres a écrit :
On Dec 16, 2010, at 3:31 AM, Pascal Deveze wrote:

int main(int argc, char **argv) {
  MPI_File fh;
  MPI_Info info, info_used;





I run this programon one process : salloc -p debug  -n1 mpirun -np 1 ./a.out
And I get teh assertion error:

a.out: attribute/attribute.c:763: ompi_attr_delete: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (keyval))->obj_magic_id' failed.
[cuzco10:24785] *** Process received signal ***
[cuzco10:24785] Signal: Aborted (6)


I saw that there is a problem with an MPI_COMM_SELF communicator.

The problem disappears (and all ROMIO tests are OK) when I comment line 89 in the file ompi/mca/io/romio/romio/adio/common/ad_close.c :
     // MPI_Comm_free(&(fd->comm));

The problem disappears (and all ROMIO tests are OK) when I comment line 425 in the file ompi/mca/io/romio/romio/adio/common/cb_config_list.c :
   //  MPI_Keyval_free(&keyval);

The problem also disappears (but only 50% of the ROMIO tests are OK) when I comment line 133 in the file ompi/runtime/ompi_mpi_finalize.c:
      // ompi_attr_delete_all(COMM_ATTR, &ompi_mpi_comm_self,
     //                             ompi_mpi_comm_self.comm.c_keyhash);

It sounds like there's a problem with the ordering of shutdown of things in MPI_FINALIZE w.r.t. ROMIO.

FWIW: ROMIO violates some of our abstractions, but it's the price we pay for using a 3rd party package.  One very, very important abstraction that we have is that no top-level MPI API functions are not allowed to call any other MPI API functions.  E.g., MPI_Send (i.e., ompi/mpi/c/send.c) cannot call MPI_Isend (i.e., ompi/mpi/c/isend.c).  MPI_Send *can* call the same back-end implementation functions that isend does -- it's just not allowed to call MPI_<foo>.

The reason is that the top-level MPI API functions do things like check for whether MPI_INIT / MPI_FINALIZE have been called, etc.  The back-end functions do not do this.  Additionally, top-level MPI API functions may be overridden via PMPI kinds of things.  We wouldn't want our internal library calls to get intercepted by user code.

I am not very familiar with the OBJ_RELEASE/OBJ_RETAIN mechanism and till now I do not understand what is the real origin of that problem.

RETAIN/RELEASE is part of OMPI's "poor man's C++" design.  Waaaay back in the beginning of the project, we debated whether to use C or C++ for developing the code.  There was a desire to use some of the basic object functionality of C++ (e.g., derived classes, constructors, destructors, etc.), but we wanted to stay as portable as possible.  So we ended up going with C, but with a few macros that emulate some C++-like functionality.  This led to OMPI's OBJ system that is used all over the place.  

The OBJ system does several things:

- allows you to have "constructor"- and "destructor"-like behavior for structs
- works for both stack and heap memory
- reference counting

The reference counting is perhaps the most-used function of OBJ.  Here's a sample scenario:

/* allocate some memory, call the some_object_type "constructor",
   and set the reference count of "foo" to 1 */
foo = OBJ_NEW(some_object_type);

/* increment the reference count of foo (to 2) */

/* increment the reference count of foo (to 3) */

/* decrement the reference count of foo (to 1) */

/* decrement the reference count of foo to 0 -- which will
   call foo's "destructor" and then free the memory */

The same principle works for structs on the stack -- we do the same constructor / destructor behavior, but just don't free the memory.  For example:

/* Instantiate the memory and call its "constructor" and set the
   ref count to 1 */
some_object_type foo;
OBJ_CONSTRUCT(&foo, some_object_type);

/* Increment and decrement the ref count */

/* The last RELEASE will call the destructor, but won't actually
   free the memory, because the memory was not allocated with 
   OBJ_NEW */

When the destructor is called, the OBJ system sets the magic number in the obj's memory to a sentinel value so that we know that the destructor has been called on this particular struct.  Hence, if we call OBJ_RELEASE *again* on a struct that has already had its ref count go to 0 (and therefore already had its destructor called), we get the assert error that you're seeing.

So to be totally clear: the assert error you're seeing is because some OBJ is (effectively) getting its ref count decremented below zero.  Which means it's trying to get destroyed twice.  Which means the ordering sequence of stuff in the ROMIO shutdown / MPI_FINALIZE is likely not right.

_______________________________________________ devel mailing list