Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c
From: Chris Samuel (samuel_at_[hidden])
Date: 2013-08-28 05:36:21


Hi folks,

One of our users (oh, OK, our director, one of the Dalton developers)
has found an odd behaviour of OMPI 1.6.5 on our x86 clusters and has
managed to get a small reproducer - a modified version of the
ubiquitous F90 "hello world" MPI program.

We find that if we run this program (compiled with either Intel or GCC)
after doing "ulimit -v $((1*1024*1024))" to simulate the default 1GB
memory limit for jobs under Slurm we get odd, but different behaviour.

With the Intel compilers it appears to just hang, but if I run it under
strace I see it looping constantly SEGV'ing.

With RHEL 6.4 gfortran it instead SEGV's straight away and gives a
stack trace:

 Hello, world, I am 0 of 1
[barcoo:27489] *** Process received signal ***
[barcoo:27489] Signal: Segmentation fault (11)
[barcoo:27489] Signal code: Address not mapped (1)
[barcoo:27489] Failing at address: 0x2008e5708
[barcoo:27489] [ 0] /lib64/libpthread.so.0() [0x3f7b60f500]
[barcoo:27489] [ 1] /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x982) [0x7f83caff6dd2]
[barcoo:27489] [ 2] /usr/local/openmpi/1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x52) [0x7f83caff7f42]
[barcoo:27489] [ 3] ./gnumyhello_f90(MAIN__+0x146) [0x400f6a]
[barcoo:27489] [ 4] ./gnumyhello_f90(main+0x2a) [0x4011ea]
[barcoo:27489] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f7b21ecdd]
[barcoo:27489] [ 6] ./gnumyhello_f90() [0x400d69]
[barcoo:27489] *** End of error message ***

If I let it generate a core file "bt" tells me:

(gdb) bt
#0 sYSMALLOc (av=0xffffffffffffefd0, bytes=<value optimized out>) at malloc.c:3240
#1 opal_memory_ptmalloc2_int_malloc (av=0xffffffffffffefd0, bytes=<value optimized out>) at malloc.c:4328
#2 0x00007f83caff7f42 in opal_memory_ptmalloc2_malloc (bytes=8560000000) at malloc.c:3433
#3 0x0000000000400f6a in main () at gnumyhello_f90.f90:26
#4 0x00000000004011ea in main ()

I've attached his reproducer program, I've just compiled it with:

mpif90 -g -o ./gnumyhello_f90 gnumyhello_f90.f90

We've reproduced it on two different Intel clusters (both RHEL 6.4,
one Nehalem and one SandyBridge) so I'd be really interested to
know if this is a bug?

Thanks!
Chris

-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci