Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] help me understand these error msgs
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-17 14:54:13


On Jan 17, 2013, at 2:25 AM, Jure Pečar <pegasus_at_[hidden]> wrote:

> On Wed, 16 Jan 2013 07:46:41 -0800
> Ralph Castain <rhc_at_[hidden]> wrote:
>
>> This one means that a backend node lost its connection to mpirun. We use a TCP socket between the daemon on a node and mpirun to launch the processes and to detect if/when that node fails for some reason.
>
> Hm. And what would be the reasons for this? Too much load on node where mpirun is run?

No, the error means the connection was completely lost - i.e., the socket was closed. Do I understand correctly that the job runs for awhile and then dies? So there are processes executing on the node that reports a lost connection?

Or is this happening on startup of the larger job, or during a call to MPI_Comm_spawn?

>
> --
>
> Jure Pečar
> http://jure.pecar.org
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users