Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trapping fortran I/O errors leaving zombiempiprocesses
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-01-29 07:59:55


On Jan 28, 2010, at 2:23 PM, Laurence Marks wrote:

> > If one process dies prematurely in Open MPI (i.e., before MPI_Finalize), all the others > should be automatically killed.
>
> This does not seem to be happening. Part of the problem may be (and I
> am out of my depth here) that the fortran rtl library (ifort) does not
> appear to be dying prematurely, at least there is no signal that I can
> catch (I'm not a good c programmer).

Ahh. That would be a problem. If the process doesn't die, then Open MPI has no way to know that it is hung, and therefore any other MPI processes that are waiting for messages (or whatever) from the hung process will eventually block waiting for input that will never come. End result: the entire MPI job hangs.

Can you double check that this is actually what is happening? I.e., that no process is actually exiting? It would just be good to confirm that that is actually what is happening (and make me feel better that we don't have some corner case where an MPI process aborting early isn't terminating the entire job properly). If you run your MPI job and you see this error occurs, go run "ps" on all the nodes where the job is running and count the number of MPI processes that you see.

> I posted to the Intel ifort site as well, and the response I got (see
> link below) is that "There is a feature request in to add this
> functionality, but it is not currently on the list for
> implementation."
>
> http://software.intel.com/en-us/forums/showthread.php?t=71571&o=d&s=lr

Bummer!

I'm tangentially involved in Fortran/MPI stuff, but I'm not enough of a Fortran expert to know how to help here -- I understand that in your final production code, this problem likely won't occur. But that doesn't help while you're writing / debugging the code itself (which is a huge amount of time and effort).

-- 
Jeff Squyres
jsquyres_at_[hidden]