Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] This must be ssh problem, but I can't figure out what it is...
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-02-17 09:52:44


On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote:

> For now, may I point out something I noticed out of the
> DEBUG3 Output last night?
>
> I found this line:
>
>> debug1: Sending command: orted --daemonize -mca ess env -mca
>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"

What this means is that ssh sent the "orted ..." command to the remote side.

As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" -- it's a helper thingy that mpirun launches on the remote nodes before launching your actual application. All those parameters (from --daemonize through ...:56064") are options for orted.

All of that gorp is considered internal to Open MPI -- most people never see that stuff.

> Followed by:
>
>> debug2: channel 0: request exec confirm 1
>> debug2: fd 3 setting TCP_NODELAY
>> debug2: callback done
>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>> debug3: Wrote 272 bytes for a total of 1893
>> debug2: channel 0: rcvd adjust 2097152
>> debug2: channel_input_status_confirm: type 99 id 0

This is just more status information about the ssh connection; it doesn't really have any direct relation to Open MPI.

I don't know offhand if ssh displays the ack that a command successfully ran. If you're not convinced that it did, then login to the other node while the command is hung and run a ps to see if the orted is actually running or not. I *suspect* that it is running, but that it's just hung for some reason.

-----

Here's some suggestions to try debugging:

On your new linux AMI instances (some of this may be redundant with what you did already):

- ensure that firewalling is disabled on all instances

- ensure that your .bashrc (or whatever startup file is relevant to your shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI installation. Ensure the *PREFIX* these variables to guarantee that you don't get interference from already-installed versions of Open MPI (e.g., if Open MPI is installed by default on your AMI and you weren't aware of it)

- setup a simple, per-user SSH key, perhaps something like this:

     A$ rm -rf $HOME/.ssh
   (remove what you had before; let's just start over)

     A$ ssh-keygen -t dsa
   (hit enter to accept all defaults and set no passphrase)

     A$ cd $HOME/.ssh
     A$ cp id_dsa.pub authorized_keys
     A$ chmod 644 authorized_keys
     A$ ssh othernode
   (login to node B)

     B$ ssh-keygen -t dsa
   (hit enter to accept all defaults and set no passphrase; just to create $HOME/.ssh with the right permissions, etc.)

     B$ scp @firstnode:.ssh/id_dsa\* .
   (enter your password on A -- we're overwriting all the files here)

     B$ cp id_dsa.pub authorized_keys
     B$ chmod 644 authorized_keys

Now you should be able to ssh from one node to the other without passwords:

     A$ ssh othernode hostname
     B
     A$

and

     B$ ssh firstnode hostname
     A
     B$

Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to ensure that non-interactive logins work properly. That's what Open MPI will use under the covers.

- Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh sessions (i.e., some .bashrc's will exit "early" if they detect that it is a non-interactive session). For example:

     A$ ssh othernode env | grep -i path

Ensure that the output shows the path and ld_library_path locations for Open MPI at the beginning of those variables. To go for the gold, you can try this, too:

     A$ ssh othernode which ompi_info
     (if all paths are set right, this should show the ompi_info of your 1.4.3 install)
     A$ ssh othernode ompi_info
     (should show all the info about your 1.4.3 install)

- If all the above works, then test with a simple, non-MPI application across both nodes:

     A$ mpirun --host firstnode,othernode -np 2 hostname
     A
     B
     A$

- When that works, you should be able to test with a simple MPI application (e.g., the examples/ring_c.c file in the Open MPI distribution):

     A$ cd /path/to/open/mpi/source
     A$ cd examples
     A$ make
     ...
     A$ scp ring_c @othernode:/path/to/open/mpi/source/examples
     ...
     A$ mpirun --host firstnode,othernode -np 4 ring_c

Make sense?

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/