On Apr 4, 2011, at 8:18 AM, Rob Latham wrote:
> On Sat, Apr 02, 2011 at 04:59:34PM -0400, fah10_at_[hidden] wrote:
>> opal_mutex_lock(): Resource deadlock avoided
>> #0 0x0012e416 in __kernel_vsyscall ()
>> #1 0x01035941 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>> #2 0x01038e42 in abort () at abort.c:92
>> #3 0x00d9da68 in ompi_attr_free_keyval (type=COMM_ATTR, key=0xbffda0e4, predefined=0 '\000') at attribute/attribute.c:656
>> #4 0x00dd8aa2 in PMPI_Keyval_free (keyval=0xbffda0e4) at pkeyval_free.c:52
>> #5 0x01bf3e6a in ADIOI_End_call (comm=0xf1c0c0, keyval=10, attribute_val=0x0, extra_state=0x0) at ad_end.c:82
>> #6 0x00da01bb in ompi_attr_delete. (type=UNUSED_ATTR, object=0x6, attr_hash=0x2c64, key=14285602, predefined=232 '\350', need_lock=128 '\200') at attribute/attribute.c:726
>> #7 0x00d9fb22 in ompi_attr_delete_all (type=COMM_ATTR, object=0xf1c0c0, attr_hash=0x8d0fee8) at attribute/attribute.c:1043
>> #8 0x00dbda65 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:133
>> #9 0x00dd12c2 in PMPI_Finalize () at pfinalize.c:46
>> #10 0x00d6b515 in mpi_finalize_f (ierr=0xbffda2b8) at pfinalize_f.c:62
> I guess I need some OpenMPI eyeballs on this...
> ROMIO hooks into the attribute keyval deletion mechanism to clean up
> the internal data structures it has allocated. I suppose since this
> is MPI_Finalize, we could just leave those internal data structures
> alone and let the OS deal with it.
> What I see happening here is the OpenMPI finalize routine is deleting
> attributes. one of those attributes is ROMIO's, which in turn tries
> to free keyvals. Is the deadlock that noting "under" ompi_attr_delete
> can itself call ompi_* routines? (as ROMIO triggers a call to
> ompi_attr_free_keyval) ?
> Here's where ROMIO sets up the keyval and the delete handler:
> that routine gets called upon any "MPI-IO entry point" (open, delete,
> register-datarep). The keyvals help ensure that ROMIO's internal
> structures get initialized exactly once, and the delete hooks help us
> be good citizens and clean up on exit.
FWIW: his trace shows that OMPI incorrectly attempts to acquire a thread lock that has already been locked. This occurs in OMPI's attribute code, probably surrounding the call to your code.
In other words, it looks to me like the problem is on our side, not yours. Jeff is the one who generally handles the attribute code, though, so I'll ping his eyeballs :-)
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> users mailing list