As Jeff indicated, the degree of capability has improved over time - I'm not
sure which version this represents.
The type of failure also plays a major role in our ability to respond. If a
process actually segfaults or dies, we usually pick that up pretty well and
abort the rest of the job (certainly, that seems to be working pretty well
in the 1.2 series and beyond).
If an MPI communication fails, I'm not sure what the MPI layer does - I
believe it may retry for awhile, but I don't know how robust the error
handling is in that layer. Perhaps someone else could address that question.
If an actual node fails, then we don't handle that very well at all, even in
today's development version. The problem is that we need to rely on the
daemon on that node to tell us that the local procs died - if the node dies,
then the daemon can't do that, so we never know it happened.
We are working on solutions to that problem. Hopefully, we will have at
least a preliminary version in the next release.
On 11/7/07 6:44 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
> Support for failure scenarios is something that is getting better over
> time in Open MPI.
> It looks like the version you are using either didn't properly catch
> that there was a failure and/or then cleanly exit all MPI processes.
> On Nov 6, 2007, at 9:01 PM, Teng Lin wrote:
>> Just realize I have a job run for a long time, while some of the nodes
>> already die. Is there any way to ask other nodes to quit ?
>> [kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with
>> [kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with
>> The FAQ does mention it is related to :
>> Connection reset by peer: These types of errors usually occur after
>> MPI_INIT has completed, and typically indicate that an MPI process has
>> died unexpectedly (e.g., due to a seg fault). The specific error
>> message indicates that a peer MPI process tried to write to the now-
>> dead MPI process and failed.
>> users mailing list