Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trapping fortran I/O errors leaving zombiempiprocesses
From: Laurence Marks (L-marks_at_[hidden])
Date: 2010-01-29 08:23:44


On Fri, Jan 29, 2010 at 6:59 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Jan 28, 2010, at 2:23 PM, Laurence Marks wrote:
>
>> > If one process dies prematurely in Open MPI (i.e., before MPI_Finalize), all the others > should be automatically killed.
>>
>> This does not seem to be happening. Part of the problem may be (and I
>> am out of my depth here) that the fortran rtl library (ifort) does not
>> appear to be dying prematurely, at least there is no signal that I can
>> catch (I'm not a good c programmer).
>
> Ahh.  That would be a problem.  If the process doesn't die, then Open MPI has no way to know that it is hung, and therefore any other MPI processes that are waiting for messages (or whatever) from the hung process will eventually block waiting for input that will never come.  End result: the entire MPI job hangs.
>
> Can you double check that this is actually what is happening?  I.e., that no process is actually exiting?  It would just be good to confirm that that is actually what is happening (and make me feel better that we don't have some corner case where an MPI process aborting early isn't terminating the entire job properly).  If you run your MPI job and you see this error occurs, go run "ps" on all the nodes where the job is running and count the number of MPI processes that you see.

I'll try, but sometimes these things are hard to reproduce and I have
to wait for free nodes to do the test. If I do manage to reproduce the
issue (I've added ERR= traps, so would have to regress) any thing else
to look at?

>
>> I posted to the Intel ifort site as well, and the response I got (see
>> link below) is that "There is a feature request in to add this
>> functionality, but it is not currently on the list for
>> implementation."
>>
>> http://software.intel.com/en-us/forums/showthread.php?t=71571&o=d&s=lr
>
> Bummer!
>
> I'm tangentially involved in Fortran/MPI stuff, but I'm not enough of a Fortran expert to know how to help here -- I understand that in your final production code, this problem likely won't occur.  But that doesn't help while you're writing / debugging the code itself (which is a huge amount of time and effort).
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.