I am using Open MPI v1.8.2 night snapshot compiled with SLURM support (version 14.03pre5). These two messages below appeared during a job of 2048 MPI that died after 24 hours!
[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
[warn] Epoll ADD(4) on fd 2 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Operation not permitted
The first one, appeared immediately at the beginning had no effect. The application started to compute and it successfully called a big parallel eigensolver. The second message appeared after 18~19 hours of non-stop computation and the application crashed without showing any other error message! Regularly I was checking that MPI processes were not stuck, after this message the processes were all aborted without dumping anything on stdout/stderr. It is quite weird.
I believe these messages come from Open MPI (but correct me if I am wrong!). I am going to look at the application and the various libraries to find out if something is wrong. In the meanwhile it will be a great help if anyone can clarify the exact meaning of these warning messages.
Many thanks in advance.
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."