Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] "Connection to lifeline lost" when developing a new rsh agent
From: Yann RADENAC (Yann.Radenac_at_[hidden])
Date: 2012-08-21 11:15:46


Le 20/08/2012 15:56, Ralph Castain wrote :
> You might try adding "-mca plm_base_verbose 5 --debug-daemons" to
watch the debug output from the daemons as they are launched.

There seems to be an interference here: my problem is "solved" by
enabling option --debug-daemons with a verbose level > 0 !!

This command fails (3 processes on 3 different machines):

mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached
   -np 3 -host `xreservation -a $XOS_RSVID` mpi/hello_world_MPI

This command works !!!
(just adding the debug-daemons with verbose level > 0) :

mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached
  -mca plm_base_verbose 5 --debug-daemons -np 3 -host `xreservation -a
$XOS_RSVID` mpi/hello_world_MPI

Anyway, this is just a workaround, and requiring the users to set the
debug-daemons option is not acceptable.

So what ssh is doing, and also the debug-daemons, that my agent
xos-createProcess is not doing?

> The lifeline is a socket connection between the daemons and mpirun. For some reason, the socket from your remote daemon back to mpirun is being closed, which the remote daemon interprets as "lifeline lost" and terminates itself. You could try setting the verbosity on the OOB to get the debug output from it (see "ompi_info --param oob tcp" for the settings), though it's likely to just tell you that the socket closed.

By the way, no firewall is running on any of my machines.

Using the oob_tcp options:

mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached
  -mca oob_tcp_debug 1 -mca oob_tcp_verbose 2 -np 3 -host
`xreservation -a $XOS_RSVID` mpi/hello_world_MPI

On the machine running the mpirun, the process is still waiting
(polling) and standard error output is:

[paradent-26.rennes.grid5000.fr:27762] [[1338,0],0]-[[1338,0],2]
accepted: 172.16.97.26 - 172.16.97.6 nodelay 1 sndbuf 262142 rcvbuf
262142 flags 00000802
[paradent-26.rennes.grid5000.fr:27762] [[1338,0],0]-[[1338,0],2]
mca_oob_tcp_recv_handler: rejected connection from [[1338,0],2]
connection state 4

On the remote machine running the orted, orted fails and standard error
output is:

[paradent-6.rennes.grid5000.fr:10391] [[1338,0],2] routed:binomial:
Connection to lifeline [[1338,0],0] lost