Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Job hangs when daemon does not report back from remote machine
From: Kersey Black (kblack_at_[hidden])
Date: 2009-02-07 16:40:32


Hi,
Disclaimer up front -- a newbie to openmpi working to get Gromacs and
other modeling code running.
I have it running fine on the local machine, but I am unable to get
openmpi to work when trying to include a remote machine.
Any help or pointers would be greatly appreciated.

System: opensuse, 10.3.
Openmpi: first I installed 1.2.2 as rpm from yast, and, when that
did not seem to work, I switched to the current release of 1.3,
compiled with default configuration options, except I did use the --
prefix to set the installation directory
openmpi-mca-params.conf: (with 1.3) I have only added
    btl = self,tcp
    mpi_show_mca_params = enviro
ssh: host-based authentication

With both installs, I can run on multiple slots on the local machine,
but when I try to include a remote machine, it hangs.
Using this hostfile:
   ccn3 slots=2 max_slots=2
   ccn4 slots=2 max_slots=2
Typical output (this is from 1.3) when I try to run two slots locally
(ccn3) and 2 on the remote machine (ccn4):
-----
black_at_ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -np
4 hostname
Daemon was launched on ccn3 - beginning to initialize
Daemon [[63883,0],1] checking in as pid 20554 on host ccn3
Daemon [[63883,0],1] not using static ports
[ccn3:20554] [[63883,0],1] orted: up and running - waiting for commands!
Daemon was launched on ccn4 - beginning to initialize
Daemon [[63883,0],2] checking in as pid 7485 on host ccn4
Daemon [[63883,0],2] not using static ports

----
And here it hangs
When I kill the job with ^C, I get:
	ccn3
	ccn4 - daemon did not report back when launched
Everything I read in the FAQ (in particular in part 2 of the "Running  
MPI" portion) suggests that this has to do with SSH problems, or with  
PATH problems.
SSH is configured and working for host-based authentication.  It seems  
to be fine.
I set the LD_LIBRARY_PATH to include openmpi/lib and include the  
openmpi/bin directory in PATH as part of a script that runs for all  
users (called by /bin/bashrc.local), and when things did not work, I  
included the same code in ~/.bashrc and ~/.profile.  All of this  
results in it being set 3 times (from `env`) in a interactive shell,  
but it has not solved the problem.
For comparison, when I run it locally on just two slots on the local  
machine, I get:
black_at_ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -np  
2 hostname
Daemon was launched on ccn3 - beginning to initialize
Daemon [[63924,0],1] checking in as pid 20608 on host ccn3
Daemon [[63924,0],1] not using static ports
[ccn3:20603] [[63924,0],0] orted_cmd: received add_local_procs
[ccn3:20603] [[63924,0],0] node[0].name ccn3 daemon 0 arch ffc91200
[ccn3:20603] [[63924,0],0] node[1].name ccn3 daemon 1 arch ffc91200
[ccn3:20603] [[63924,0],0] node[2].name ccn4 daemon INVALID arch  
ffc91200
[ccn3:20608] [[63924,0],1] orted: up and running - waiting for commands!
[ccn3:20608] [[63924,0],1] orted_cmd: received add_local_procs
[ccn3:20608] [[63924,0],1] node[0].name ccn3 daemon 0 arch ffc91200
[ccn3:20608] [[63924,0],1] node[1].name ccn3 daemon 1 arch ffc91200
[ccn3:20608] [[63924,0],1] node[2].name ccn4 daemon INVALID arch  
ffc91200
ccn3
[ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
ccn3
[ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received exit
[ccn3:20608] [[63924,0],1] orted: finalizing
I can also run it locally on the remote machine with the command:
ssh ccn4 mpirun --debug-daemons -np 2 hostname
Many thanks for any ideas.
Kersey