Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Job hangs when daemon does not report back from remote machine
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-08 11:56:07


It sounds to me like TCP communication isn't getting through for some
reason. Try the following:

mpirun --mca plm_base_verbose 5 --hostfile myh3 -pernode hostname

You should see output from the receipt of a daemon callback for each
daemon, the the sending of the launch command. My guess is that you
won't see all the daemons callback, which is why you hang.

This should tell you which node isn't getting a message back to
wherever mpirun is executing. You might then check to ensure no
firewalls are in the way to that node, there is a TCP path back from
it, etc.

I can help with additional diagnostics once we get that far.
Ralph

On Feb 7, 2009, at 2:40 PM, Kersey Black wrote:

> Hi,
> Disclaimer up front -- a newbie to openmpi working to get Gromacs
> and other modeling code running.
> I have it running fine on the local machine, but I am unable to get
> openmpi to work when trying to include a remote machine.
> Any help or pointers would be greatly appreciated.
>
> System: opensuse, 10.3.
> Openmpi: first I installed 1.2.2 as rpm from yast, and, when that
> did not seem to work, I switched to the current release of 1.3,
> compiled with default configuration options, except I did use the --
> prefix to set the installation directory
> openmpi-mca-params.conf: (with 1.3) I have only added
> btl = self,tcp
> mpi_show_mca_params = enviro
> ssh: host-based authentication
>
> With both installs, I can run on multiple slots on the local
> machine, but when I try to include a remote machine, it hangs.
> Using this hostfile:
> ccn3 slots=2 max_slots=2
> ccn4 slots=2 max_slots=2
> Typical output (this is from 1.3) when I try to run two slots
> locally (ccn3) and 2 on the remote machine (ccn4):
> -----
> black_at_ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -
> np 4 hostname
> Daemon was launched on ccn3 - beginning to initialize
> Daemon [[63883,0],1] checking in as pid 20554 on host ccn3
> Daemon [[63883,0],1] not using static ports
> [ccn3:20554] [[63883,0],1] orted: up and running - waiting for
> commands!
> Daemon was launched on ccn4 - beginning to initialize
> Daemon [[63883,0],2] checking in as pid 7485 on host ccn4
> Daemon [[63883,0],2] not using static ports
> ----
> And here it hangs
>
> When I kill the job with ^C, I get:
> ccn3
> ccn4 - daemon did not report back when launched
>
> Everything I read in the FAQ (in particular in part 2 of the
> "Running MPI" portion) suggests that this has to do with SSH
> problems, or with PATH problems.
> SSH is configured and working for host-based authentication. It
> seems to be fine.
> I set the LD_LIBRARY_PATH to include openmpi/lib and include the
> openmpi/bin directory in PATH as part of a script that runs for all
> users (called by /bin/bashrc.local), and when things did not work, I
> included the same code in ~/.bashrc and ~/.profile. All of this
> results in it being set 3 times (from `env`) in a interactive shell,
> but it has not solved the problem.
>
> For comparison, when I run it locally on just two slots on the local
> machine, I get:
> black_at_ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -
> np 2 hostname
> Daemon was launched on ccn3 - beginning to initialize
> Daemon [[63924,0],1] checking in as pid 20608 on host ccn3
> Daemon [[63924,0],1] not using static ports
> [ccn3:20603] [[63924,0],0] orted_cmd: received add_local_procs
> [ccn3:20603] [[63924,0],0] node[0].name ccn3 daemon 0 arch ffc91200
> [ccn3:20603] [[63924,0],0] node[1].name ccn3 daemon 1 arch ffc91200
> [ccn3:20603] [[63924,0],0] node[2].name ccn4 daemon INVALID arch
> ffc91200
> [ccn3:20608] [[63924,0],1] orted: up and running - waiting for
> commands!
> [ccn3:20608] [[63924,0],1] orted_cmd: received add_local_procs
> [ccn3:20608] [[63924,0],1] node[0].name ccn3 daemon 0 arch ffc91200
> [ccn3:20608] [[63924,0],1] node[1].name ccn3 daemon 1 arch ffc91200
> [ccn3:20608] [[63924,0],1] node[2].name ccn4 daemon INVALID arch
> ffc91200
> ccn3
> [ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
> [ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
> ccn3
> [ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
> [ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
> [ccn3:20608] [[63924,0],1] orted_cmd: received exit
> [ccn3:20608] [[63924,0],1] orted: finalizing
>
> I can also run it locally on the remote machine with the command:
> ssh ccn4 mpirun --debug-daemons -np 2 hostname
>
> Many thanks for any ideas.
>
> Kersey
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users