Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Deadly warning "Epoll ADD(4) on fd 2 failed." ?
From: Filippo Spiga (spiga.filippo_at_[hidden])
Date: 2014-05-28 03:03:08


Dear Ralph,

On May 27, 2014, at 6:31 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> So out of curiosity - how was this job launched? Via mpirun or directly using srun?

The job has been submitted using mpirun. However Open MPI is compiled with SLURM support (and I start to believe this is might not ideal after all !!!). I have a partial job trace dumped by the process when it died:

--------------------------------------------------------------------------
mpirun noticed that process rank 8190 with PID 29319 on node sand-8-39 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
diag_OMPI-INTEL.x 0000000000537349 Unknown Unknown Unknown
diag_OMPI-INTEL.x 0000000000535C1E Unknown Unknown Unknown
diag_OMPI-INTEL.x 000000000050CF52 Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004F0BB3 Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004BEB99 Unknown Unknown Unknown
libpthread.so.0 00007FE5B5BE5710 Unknown Unknown Unknown
libmlx4-rdmav2.so 00007FE5A8C0A867 Unknown Unknown Unknown
mca_btl_openib.so 00007FE5ADA36644 Unknown Unknown Unknown
libopen-pal.so.6 00007FE5B288262A Unknown Unknown Unknown
mca_pml_ob1.so 00007FE5AC344FAF Unknown Unknown Unknown
libmpi.so.1 00007FE5B5064E7D Unknown Unknown Unknown
libmpi_mpifh.so.2 00007FE5B531919B Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EC0CE Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EBE36 Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EBDFD Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EC2CD Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EB798 Unknown Unknown Unknown
libelpa.so.0 00007FE5B82E571A Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004101C2 MAIN__ 562 dirac_exomol_eigen.f90
diag_OMPI-INTEL.x 000000000040A1A6 Unknown Unknown Unknown
libc.so.6 00007FE5B4A89D1D Unknown Unknown Unknown
diag_OMPI-INTEL.x 000000000040A099 Unknown Unknown Unknown

(plus many other trace information like this)

No more information that this unfortunately because not everything library has been built using debug flags. The computation is all concentrated in ScaLAPACK and ELPA that I recompiled by myself, I run over 8192 MPI and the memory allocated per MPI process was below 1 GByte (per MPI). My compute nodes have 64 GByte of RAM and 2 eight-core Intel Sandy Bridge. Since 512 nodes are 80% of the cluster I have available for this test, I cannot easily reschedule a repetition of the test.

I wonder if this message that can be related to libevent may in principle cause this seg fault error. I am working to understand the cause on my side but so far a reduced problem size using less nodes never failed.

Any help is much appreciated!

Regards,
F

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."