Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Segmentation fault on OMPI 1.6.5 built with gcc 4.4.7 and PGI pgfortran 11.10
From: Gus Correa (gus_at_[hidden])
Date: 2013-12-23 18:27:52


Dear OMPI experts

I have been using OMPI 1.6.5 built with gcc 4.4.7 and
PGI pgfortran 11.10 to successfully compile and run
a large climate modeling program (CESM) in several
different configurations.

However, today I hit a segmentation fault when running a new model
configuration.
[In the climate modeling jargon, a program is called a "model".]

This is somewhat unpleasant because that OMPI build
is a central piece of the production CESM model setup available
to all users in our two clusters at this point.
I have other OMPI 1.6.5 builds, with other compilers, but that one
was working very well with CESM, until today.

Unless I am misinterpreting it, the error message,
reproduced below, seems to indicate the problem
happened inside the OMPI library.
Or not?

Other details:

Nodes are AMD Opteron 6376 x86_64, interconnect is Infiniband QDR,
OS is stock CentOS 6.4, kernel 2.6.32-358.2.1.el6.x86_64.
The program is compiled with the OMPI wrappers (mpicc and mpif90),
and somewhat conservative optimization flags:

FFLAGS := $(CPPDEFS) -i4 -gopt -Mlist -Mextend -byteswapio
-Minform=inform -traceback -O2 -Mvect=nosse -Kieee

Is this a known issue?
Any clues on how to address it?

Thank you for your help,
Gus Correa

**************** error message *******************

[1,31]<stderr>:[node30:17008] *** Process received signal ***
[1,31]<stderr>:[node30:17008] Signal: Segmentation fault (11)
[1,31]<stderr>:[node30:17008] Signal code: Address not mapped (1)
[1,31]<stderr>:[node30:17008] Failing at address: 0x17
[1,31]<stderr>:[node30:17008] [ 0] /lib64/libpthread.so.0(+0xf500)
[0x2b788ef9f500]
[1,31]<stderr>:[node30:17008] [ 1]
/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(+0x100ee3)
[0x2b788e200ee3]
[1,31]<stderr>:[node30:17008] [ 2]
/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x111)
[0x2b788e203771]
[1,31]<stderr>:[node30:17008] [ 3]
/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x97)
[0x2b788e2046d7]
[1,31]<stderr>:[node30:17008] [ 4]
/sw/openmpi/1.6.5/gnu-4.4.7-pgi-11.10/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x8b)
[0x2b788e2052ab]
[1,31]<stderr>:[node30:17008] [ 5] ./ccsm.exe(pgf90_auto_alloc+0x73)
[0xe2c4c3]
[1,31]<stderr>:[node30:17008] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 31 with PID 17008 on node node30
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------