Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-07 08:44:41

Support for failure scenarios is something that is getting better over
time in Open MPI.

It looks like the version you are using either didn't properly catch
that there was a failure and/or then cleanly exit all MPI processes.

On Nov 6, 2007, at 9:01 PM, Teng Lin wrote:

> Hi,
> Just realize I have a job run for a long time, while some of the nodes
> already die. Is there any way to ask other nodes to quit ?
> [kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with
> errno=104
> [kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with
> errno=104
> The FAQ does mention it is related to :
> Connection reset by peer: These types of errors usually occur after
> MPI_INIT has completed, and typically indicate that an MPI process has
> died unexpectedly (e.g., due to a seg fault). The specific error
> message indicates that a peer MPI process tried to write to the now-
> dead MPI process and failed.
> Thanks,
> Teng
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Cisco Systems