On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote:
> For now, may I point out something I noticed out of the
> DEBUG3 Output last night?
>
> I found this line:
>
>> debug1: Sending command: orted --daemonize -mca ess env -mca
>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
What this means is that ssh sent the "orted ..." command to the remote side.
As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" -- it's a helper thingy that mpirun launches on the remote nodes before launching your actual application. All those parameters (from --daemonize through ...:56064") are options for orted.
All of that gorp is considered internal to Open MPI -- most people never see that stuff.
> Followed by:
>
>> debug2: channel 0: request exec confirm 1
>> debug2: fd 3 setting TCP_NODELAY
>> debug2: callback done
>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>> debug3: Wrote 272 bytes for a total of 1893
>> debug2: channel 0: rcvd adjust 2097152
>> debug2: channel_input_status_confirm: type 99 id 0
This is just more status information about the ssh connection; it doesn't really have any direct relation to Open MPI.
I don't know offhand if ssh displays the ack that a command successfully ran. If you're not convinced that it did, then login to the other node while the command is hung and run a ps to see if the orted is actually running or not. I *suspect* that it is running, but that it's just hung for some reason.
-----
Here's some suggestions to try debugging:
On your new linux AMI instances (some of this may be redundant with what you did already):
- ensure that firewalling is disabled on all instances
- ensure that your .bashrc (or whatever startup file is relevant to your shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI installation. Ensure the *PREFIX* these variables to guarantee that you don't get interference from already-installed versions of Open MPI (e.g., if Open MPI is installed by default on your AMI and you weren't aware of it)
- setup a simple, per-user SSH key, perhaps something like this:
A$ rm -rf $HOME/.ssh
(remove what you had before; let's just start over)
A$ ssh-keygen -t dsa
(hit enter to accept all defaults and set no passphrase)
A$ cd $HOME/.ssh
A$ cp id_dsa.pub authorized_keys
A$ chmod 644 authorized_keys
A$ ssh othernode
(login to node B)
B$ ssh-keygen -t dsa
(hit enter to accept all defaults and set no passphrase; just to create $HOME/.ssh with the right permissions, etc.)
B$ scp @firstnode:.ssh/id_dsa\* .
(enter your password on A -- we're overwriting all the files here)
B$ cp id_dsa.pub authorized_keys
B$ chmod 644 authorized_keys
Now you should be able to ssh from one node to the other without passwords:
A$ ssh othernode hostname
B
A$
and
B$ ssh firstnode hostname
A
B$
Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to ensure that non-interactive logins work properly. That's what Open MPI will use under the covers.
- Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh sessions (i.e., some .bashrc's will exit "early" if they detect that it is a non-interactive session). For example:
A$ ssh othernode env | grep -i path
Ensure that the output shows the path and ld_library_path locations for Open MPI at the beginning of those variables. To go for the gold, you can try this, too:
A$ ssh othernode which ompi_info
(if all paths are set right, this should show the ompi_info of your 1.4.3 install)
A$ ssh othernode ompi_info
(should show all the info about your 1.4.3 install)
- If all the above works, then test with a simple, non-MPI application across both nodes:
A$ mpirun --host firstnode,othernode -np 2 hostname
A
B
A$
- When that works, you should be able to test with a simple MPI application (e.g., the examples/ring_c.c file in the Open MPI distribution):
A$ cd /path/to/open/mpi/source
A$ cd examples
A$ make
...
A$ scp ring_c @othernode:/path/to/open/mpi/source/examples
...
A$ mpirun --host firstnode,othernode -np 4 ring_c
Make sense?
--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
|