Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] This must be ssh problem, but I can't figure out what it is...
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-02-17 09:52:44


On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote:

> For now, may I point out something I noticed out of the
> DEBUG3 Output last night?
>
> I found this line:
>
>> debug1: Sending command: orted --daemonize -mca ess env -mca
>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"

What this means is that ssh sent the "orted ..." command to the remote side.

As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" -- it's a helper thingy that mpirun launches on the remote nodes before launching your actual application. All those parameters (from --daemonize through ...:56064") are options for orted.

All of that gorp is considered internal to Open MPI -- most people never see that stuff.

> Followed by:
>
>> debug2: channel 0: request exec confirm 1
>> debug2: fd 3 setting TCP_NODELAY
>> debug2: callback done
>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>> debug3: Wrote 272 bytes for a total of 1893
>> debug2: channel 0: rcvd adjust 2097152
>> debug2: channel_input_status_confirm: type 99 id 0

This is just more status information about the ssh connection; it doesn't really have any direct relation to Open MPI.

I don't know offhand if ssh displays the ack that a command successfully ran. If you're not convinced that it did, then login to the other node while the command is hung and run a ps to see if the orted is actually running or not. I *suspect* that it is running, but that it's just hung for some reason.

-----

Here's some suggestions to try debugging:

On your new linux AMI instances (some of this may be redundant with what you did already):

- ensure that firewalling is disabled on all instances

- ensure that your .bashrc (or whatever startup file is relevant to your shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI installation. Ensure the *PREFIX* these variables to guarantee that you don't get interference from already-installed versions of Open MPI (e.g., if Open MPI is installed by default on your AMI and you weren't aware of it)

- setup a simple, per-user SSH key, perhaps something like this:

     A$ rm -rf $HOME/.ssh
   (remove what you had before; let's just start over)

     A$ ssh-keygen -t dsa
   (hit enter to accept all defaults and set no passphrase)

     A$ cd $HOME/.ssh
     A$ cp id_dsa.pub authorized_keys
     A$ chmod 644 authorized_keys
     A$ ssh othernode
   (login to node B)

     B$ ssh-keygen -t dsa
   (hit enter to accept all defaults and set no passphrase; just to create $HOME/.ssh with the right permissions, etc.)

     B$ scp @firstnode:.ssh/id_dsa\* .
   (enter your password on A -- we're overwriting all the files here)

     B$ cp id_dsa.pub authorized_keys
     B$ chmod 644 authorized_keys

Now you should be able to ssh from one node to the other without passwords:

     A$ ssh othernode hostname
     B
     A$

and

     B$ ssh firstnode hostname
     A
     B$

Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to ensure that non-interactive logins work properly. That's what Open MPI will use under the covers.

- Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh sessions (i.e., some .bashrc's will exit "early" if they detect that it is a non-interactive session). For example:

     A$ ssh othernode env | grep -i path

Ensure that the output shows the path and ld_library_path locations for Open MPI at the beginning of those variables. To go for the gold, you can try this, too:

     A$ ssh othernode which ompi_info
     (if all paths are set right, this should show the ompi_info of your 1.4.3 install)
     A$ ssh othernode ompi_info
     (should show all the info about your 1.4.3 install)

- If all the above works, then test with a simple, non-MPI application across both nodes:

     A$ mpirun --host firstnode,othernode -np 2 hostname
     A
     B
     A$

- When that works, you should be able to test with a simple MPI application (e.g., the examples/ring_c.c file in the Open MPI distribution):

     A$ cd /path/to/open/mpi/source
     A$ cd examples
     A$ make
     ...
     A$ scp ring_c @othernode:/path/to/open/mpi/source/examples
     ...
     A$ mpirun --host firstnode,othernode -np 4 ring_c

Make sense?

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/