Thanks for persevering with this. I'm far from sure that the
information I am providing is of much use, largely because I'm pretty
confused about what's going on. Anyway...
Brian Barrett wrote:
> Can you rebuild Open MPI with debugging symbols (just setting CFLAGS
> to -g during configure should do it), rebuild, and get a full call
> stack with line numbers?
For (superfluous) thoroughness, I did configure --enable-debug
--enable-memdebug, plus CFLAGS,FFLAGS,FCFLAGS=-g.
gdb tells me (abbreviated):
[New Thread 2853808 (LWP 16590)]
[New Thread 18697136 (LWP 16591)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 18697136 (LWP 16591)]
0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
4371 nextsize = chunksize(nextchunk);
(gdb) bt
#0 0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
#1 0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501
#2 0x08154590 in for_deallocate. ()
#3 0x08154505 in for_dealloc_allocatable ()
#4 0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40,
ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167
(gdb) bt full 5
#0 0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
p = 0x9cb4188
size = 134776
fb = (mfastbinptr *) 0xe464fd
nextchunk = 0x9cd5000
nextsize = 744
nextinuse = 15160704
prevsize = 14968205
bck = 0x11d48b4
fwd = 0x2e8
#1 0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501
ar_ptr = 0xe75580
p = 0x9cb4188
hook = (void (*)(void *, const void *)) 0
#2 0x08154590 in for_deallocate. ()
No symbol table info available.
#3 0x08154505 in for_dealloc_allocatable ()
No symbol table info available.
#4 0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40,
ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167
un = 0
sig = 0.5
qn = 0
p = 1.8660254037844382
k = 0
i = 93
u = 0x11d4904
Totalview's memory debugger tells me: "Allocator returned a block
already in use: heap may be corrupted" (at an allocation that gives
the crash when the associated storage is deallocated).
[valgrind]
> The output might be useful to us, if we could take a look (at least,
> on the OMPI build that fails). Again, doing this with a build of
> Open MPI that contains debugging symbols would greatly increase the
> usefulness to us.
I have to suppress many (irrelevant, I think...) warnings, else
valgrind stops reporting them before the crash. The final one is:
==10446==
==10446== Invalid read of size 4
==10446== at 0x1C02FA92: _int_free (malloc.c:4371)
==10446== by 0x1C02E6F9: free (malloc.c:3501)
==10446== by 0x815458F: for_deallocate. (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446== by 0x8154504: for_dealloc_allocatable (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446== Address 0x8FD3004 is not stack'd, malloc'd or (recently) free'd
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x8fd3004
[0] func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0 [0x1c02987a]
[1] func:[0x52bff000]
[2] func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0(free+0xa6) [0x1c02e6fa]
[3] func:./cosmomc(for_deallocate.+0x54) [0x8154590]
[4] func:./cosmomc(for_dealloc_allocatable+0x5b) [0x8154505]
[...]
*** End of error message ***
==10446==
==10446== Process terminating with default action of signal 11 (SIGSEGV)
==10446== Access not within mapped region at address 0x4
==10446== at 0x1C02FA92: _int_free (malloc.c:4371)
==10446== by 0x1C02E6F9: free (malloc.c:3501)
==10446== by 0x815458F: for_deallocate. (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446== by 0x8154504: for_dealloc_allocatable (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446==
|