Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] help me understand these error msgs
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-17 14:54:13

On Jan 17, 2013, at 2:25 AM, Jure Pečar <pegasus_at_[hidden]> wrote:

> On Wed, 16 Jan 2013 07:46:41 -0800
> Ralph Castain <rhc_at_[hidden]> wrote:
>> This one means that a backend node lost its connection to mpirun. We use a TCP socket between the daemon on a node and mpirun to launch the processes and to detect if/when that node fails for some reason.
> Hm. And what would be the reasons for this? Too much load on node where mpirun is run?

No, the error means the connection was completely lost - i.e., the socket was closed. Do I understand correctly that the job runs for awhile and then dies? So there are processes executing on the node that reports a lost connection?

Or is this happening on startup of the larger job, or during a call to MPI_Comm_spawn?

> --
> Jure Pečar
> _______________________________________________
> users mailing list
> users_at_[hidden]