Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Neil Ludban (nludban_at_[hidden])
Date: 2006-05-22 18:36:19


Hello,

I'm getting a core dump when using openmpi-1.0.2 with the MPI extensions
we're developing for the MATLAB interpreter. This same build of openmpi
is working great with C programs and our extensions for gnu octave. The
machine is AMD64 running Linux:

Linux kodos 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64 x86_64 GNU/Linux

I believe there's a bug in that opal_memory_malloc_hooks_init() links
itself into the __free_hook chain during initialization, but then it
never unlinks itself at shutdown. In the interpreter environment,
libopal.so is dlclose()d and unmapped from memory long before the
interpreter is done with dynamic memory. A quick check of the nightly
trunk snapshot reveals some function name changes, but no new shutdown
code.

After running this trivial MPI program on a single processor:
        MPI_Init();
        MPI_Finalize();
I'm back to the MATLAB prompt, and break into the debugger:

>>> ^C
(gdb) info sharedlibrary
>From To Syms Read Shared Object Library
...
0x0000002aa0b50740 0x0000002aa0b50a28 Yes .../mexMPI_Init.mexa64
0x0000002aa0c52a50 0x0000002aa0c54318 Yes .../lib/libbcmpi.so.0
0x0000002aa0dcef90 0x0000002aa0e37398 Yes /usr/lib64/libstdc++.so.6
0x0000002aa0fa9ec0 0x0000002aa102e118 Yes .../lib/libmpi.so.0
0x0000002aa1178560 0x0000002aa11af708 Yes .../lib/liborte.so.0
0x0000002aa12cffb0 0x0000002aa12f2988 Yes .../lib/libopal.so.0
0x0000002aa1424180 0x0000002aa14249d8 Yes /lib64/libutil.so.1
0x0000002aa152a760 0x0000002aa1536368 Yes /lib64/libnsl.so.1
0x0000002aa3540b80 0x0000002aa3551077 Yes /usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x0000002aa365e0a0 0x0000002aa3664a86 Yes /usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x0000002aa470db50 0x0000002aa4719438 Yes /usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so
0x0000002ac4e508c0 0x0000002ac4e50ed8 Yes .../mexMPI_Constants.mexa64
0x0000002ac4f52740 0x0000002ac4f52a28 Yes .../mexMPI_Finalize.mexa64

(gdb) c
>> exit

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182992729024 (LWP 21848)]
opal_mem_free_free_hook (ptr=0x7fbfff96d0, caller=0xa8d4f8) at memory_malloc_hooks.c:65

(gdb) info sharedlibrary
>From To Syms Read Shared Object Library
...
0x0000002aa1424180 0x0000002aa14249d8 Yes /lib64/libutil.so.1
0x0000002aa152a760 0x0000002aa1536368 Yes /lib64/libnsl.so.1
0x0000002aa3540b80 0x0000002aa3551077 Yes /usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x0000002aa365e0a0 0x0000002aa3664a86 Yes /usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x0000002aa470db50 0x0000002aa4719438 Yes /usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so

(gdb) list
63 static void
64 opal_mem_free_free_hook (void *ptr, const void *caller)
65 {
66 /* dispatch about the pending free */
67 opal_mem_free_release_hook(ptr, malloc_usable_size(ptr));
68
69 __free_hook = old_free_hook;
70
71 /* call the next chain down */
72 free(ptr);
73
74 /* save the hooks again and restore our hook again */

(gdb) print ptr
$2 = (void *) 0x7fbfff96d0
(gdb) print caller
$3 = (const void *) 0xa8d4f8
(gdb) print __free_hook
$4 = (void (*)(void *, const void *)) 0x2aa12f1d79 <opal_mem_free_free_hook>
(gdb) print old_free_hook
Cannot access memory at address 0x2aa1422800

Before I start blindly hacking a workaround, can somebody who's familiar
with the openmpi internals verify that this is a real bug, suggest a
correct fix, and/or comment on other potential problems with running in
an interpreter.

Thanks-

-Neil