On Jan 17, 2013, at 2:25 AM, Jure PeÄar <pegasus_at_[hidden]> wrote:
> On Wed, 16 Jan 2013 07:46:41 -0800
> Ralph Castain <rhc_at_[hidden]> wrote:
>> This one means that a backend node lost its connection to mpirun. We use a TCP socket between the daemon on a node and mpirun to launch the processes and to detect if/when that node fails for some reason.
> Hm. And what would be the reasons for this? Too much load on node where mpirun is run?
No, the error means the connection was completely lost - i.e., the socket was closed. Do I understand correctly that the job runs for awhile and then dies? So there are processes executing on the node that reports a lost connection?
Or is this happening on startup of the larger job, or during a call to MPI_Comm_spawn?
> Jure PeÄar
> users mailing list