Dear Ralph,

On May 27, 2014, at 6:31 PM, Ralph Castain <> wrote:
So out of curiosity - how was this job launched? Via mpirun or directly using srun?

The job has been submitted using mpirun. However Open MPI is compiled with SLURM support (and I start to believe this is might not ideal after all !!!). I have a partial job trace dumped by the process when it died:

mpirun noticed that process rank 8190 with PID 29319 on node sand-8-39 exited on signal 11 (Segmentation fault).

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
diag_OMPI-INTEL.x  0000000000537349  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  0000000000535C1E  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  000000000050CF52  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  00000000004F0BB3  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  00000000004BEB99  Unknown               Unknown  Unknown    00007FE5B5BE5710  Unknown               Unknown  Unknown  00007FE5A8C0A867  Unknown               Unknown  Unknown  00007FE5ADA36644  Unknown               Unknown  Unknown   00007FE5B288262A  Unknown               Unknown  Unknown     00007FE5AC344FAF  Unknown               Unknown  Unknown        00007FE5B5064E7D  Unknown               Unknown  Unknown  00007FE5B531919B  Unknown               Unknown  Unknown       00007FE5B82EC0CE  Unknown               Unknown  Unknown       00007FE5B82EBE36  Unknown               Unknown  Unknown       00007FE5B82EBDFD  Unknown               Unknown  Unknown       00007FE5B82EC2CD  Unknown               Unknown  Unknown       00007FE5B82EB798  Unknown               Unknown  Unknown       00007FE5B82E571A  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  00000000004101C2  MAIN__                    562  dirac_exomol_eigen.f90
diag_OMPI-INTEL.x  000000000040A1A6  Unknown               Unknown  Unknown          00007FE5B4A89D1D  Unknown               Unknown  Unknown
diag_OMPI-INTEL.x  000000000040A099  Unknown               Unknown  Unknown

(plus many other trace information like this)

No more information that this unfortunately because not everything library has been built using debug flags. The computation is all concentrated in ScaLAPACK and ELPA that I recompiled by myself, I run over 8192 MPI and the memory allocated per MPI process was below 1 GByte (per MPI). My compute nodes have 64 GByte of RAM and 2 eight-core Intel Sandy Bridge. Since 512 nodes are 80% of the cluster I have available for this test, I cannot easily reschedule a repetition of the test.

I wonder if this message that can be related to libevent may in principle cause this seg fault error. I am working to understand the cause on my side but so far a reduced problem size using less nodes never failed.

Any help is much appreciated!


Mr. Filippo SPIGA, M.Sc. ~ skype: filippo.spiga

źNobody will drive us out of Cantor's paradise.╗ ~ David Hilbert

Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."