Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2011-02-11 18:17:58


It is concerning if the pipe system call fails - I can't think of why that would happen. Thats not usually a permissions issue but rather a deeper indication that something is either seriously wrong on your system or you are running out of file descriptors. Are file descriptors limited on a per-process basis, perchance?

Sent from my PDA. No type good.

On Feb 11, 2011, at 10:08 AM, "Gus Correa" <gus_at_[hidden]> wrote:

> Hi Tena
>
> Since root can but you can't,
> is is a directory permission problem perhaps?
> Check the execution directory permission (on both machines,
> if this is not NFS mounted dir).
> I am not sure, but IIRR OpenMPI also uses /tmp for
> under-the-hood stuff, worth checking permissions there also.
> Just a naive guess.
>
> Congrats for all the progress with the cloudy MPI!
>
> Gus Correa
>
> Tena Sakai wrote:
>> Hi,
>> I have made a bit more progress. I think I can say ssh authenti-
>> cation problem is behind me now. I am still having a problem running
>> mpirun, but the latest discovery, which I can reproduce, is that
>> I can run mpirun as root. Here's the session log:
>> [tsakai_at_vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ ll
>> total 8
>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ ll .ssh
>> total 16
>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys
>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config
>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>> [tsakai_at_ip-10-100-243-195 ~]$
>> [tsakai_at_ip-10-100-243-195 ~]$ # I am on machine B
>> [tsakai_at_ip-10-100-243-195 ~]$ hostname
>> ip-10-100-243-195
>> [tsakai_at_ip-10-100-243-195 ~]$
>> [tsakai_at_ip-10-100-243-195 ~]$ ll
>> total 8
>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>> [tsakai_at_ip-10-100-243-195 ~]$
>> [tsakai_at_ip-10-100-243-195 ~]$
>> [tsakai_at_ip-10-100-243-195 ~]$ cat app.ac
>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>> [tsakai_at_ip-10-100-243-195 ~]$
>> [tsakai_at_ip-10-100-243-195 ~]$ # go back to machine A
>> [tsakai_at_ip-10-100-243-195 ~]$
>> [tsakai_at_ip-10-100-243-195 ~]$ exit
>> logout
>> Connection to ip-10-100-243-195.ec2.internal closed.
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ hostname
>> ip-10-195-198-31
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ # Execute mpirun
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun -app app.ac
>> --------------------------------------------------------------------------
>> mpirun was unable to launch the specified application as it encountered an
>> error:
>> Error: pipe function call failed when setting up I/O forwarding subsystem
>> Node: ip-10-195-198-31
>> while attempting to start process rank 0.
>> --------------------------------------------------------------------------
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ # try it as root
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ sudo su
>> bash-3.2#
>> bash-3.2# pwd
>> /home/tsakai
>> bash-3.2#
>> bash-3.2# ls -l /root/.ssh/config
>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>> bash-3.2#
>> bash-3.2# cat /root/.ssh/config
>> Host *
>> IdentityFile /root/.ssh/.derobee/.kagi
>> IdentitiesOnly yes
>> BatchMode yes
>> bash-3.2#
>> bash-3.2# pwd
>> /home/tsakai
>> bash-3.2#
>> bash-3.2# ls -l
>> total 8
>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>> bash-3.2#
>> bash-3.2# # now is the time for mpirun
>> bash-3.2#
>> bash-3.2# mpirun --app ./app.ac
>> 13 ip-10-100-243-195
>> 21 ip-10-100-243-195
>> 5 ip-10-195-198-31
>> 8 ip-10-195-198-31
>> bash-3.2#
>> bash-3.2# # It works (being root)!
>> bash-3.2#
>> bash-3.2# exit
>> exit
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ # try it one more time as tsakai
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun --app app.ac
>> --------------------------------------------------------------------------
>> mpirun was unable to launch the specified application as it encountered an
>> error:
>> Error: pipe function call failed when setting up I/O forwarding subsystem
>> Node: ip-10-195-198-31
>> while attempting to start process rank 0.
>> --------------------------------------------------------------------------
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ # I don't get it.
>> [tsakai_at_ip-10-195-198-31 ~]$
>> [tsakai_at_ip-10-195-198-31 ~]$ exit
>> logout
>> [tsakai_at_vixen ec2]$
>> So, why does it say "pipe function call failed when setting up
>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>> The node it is referring to is not the remote machine. It is
>> What I call machine A. I first thought maybe this is a problem
>> With PATH variable. But I don't think so. I compared root's
>> Path to that of tsaki's and made them identical and retried.
>> I got the same behavior.
>> If you could enlighten me why this is happening, I would really
>> Appreciate it.
>> Thank you.
>> Tena
>> On 2/10/11 4:12 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>> Hi jeff,
>>>
>>> Thanks for the firewall tip. I tried it while allowing all tip traffic
>>> and got interesting and preplexing result. Here's what's interesting
>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):
>>>
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>> Host key verification failed.
>>>
>>> --------------------------------------------------------------------------
>>> A daemon (pid 2743) died unexpectedly with status 255 while attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>>
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ env | grep LD_LIB
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>> /usr/local/lib
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ # I better to this on machine B as well
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>> Warning: Identity file tsakai not accessible: No such file or directory.
>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>> [tsakai_at_ip-10-195-171-159 ~]$
>>> [tsakai_at_ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>> [tsakai_at_ip-10-195-171-159 ~]$
>>> [tsakai_at_ip-10-195-171-159 ~]$ env | grep LD_LIB
>>> LD_LIBRARY_PATH=/usr/local/lib
>>> [tsakai_at_ip-10-195-171-159 ~]$
>>> [tsakai_at_ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>> [tsakai_at_ip-10-195-171-159 ~]$ exit
>>> logout
>>> Connection to ip-10-195-171-159 closed.
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ hostname
>>> ip-10-203-21-132
>>> [tsakai_at_ip-10-203-21-132 ~]$ # try mpirun again
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>> Host key verification failed.
>>>
>>> --------------------------------------------------------------------------
>>> A daemon (pid 2789) died unexpectedly with status 255 while attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>>
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>> /usr/local/lib...
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>> total 16604
>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so ->
>>> libfuse.so.2.8.5
>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 ->
>>> libfuse.so.2.8.5
>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so ->
>>> libmca_common_sm.so.1.0.0
>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 ->
>>> libmca_common_sm.so.1.0.0
>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> libmpi.so.0.0.2
>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 ->
>>> libmpi.so.0.0.2
>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so ->
>>> libmpi_cxx.so.0.0.1
>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 ->
>>> libmpi_cxx.so.0.0.1
>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so ->
>>> libmpi_f77.so.0.0.1
>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 ->
>>> libmpi_f77.so.0.0.1
>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so ->
>>> libmpi_f90.so.0.0.1
>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 ->
>>> libmpi_f90.so.0.0.1
>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so ->
>>> libopen-pal.so.0.0.0
>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 ->
>>> libopen-pal.so.0.0.0
>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so ->
>>> libopen-rte.so.0.0.0
>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 ->
>>> libopen-rte.so.0.0.0
>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so ->
>>> libopenmpi_malloc.so.0.0.0
>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 ->
>>> libopenmpi_malloc.so.0.0.0
>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so ->
>>> libulockmgr.so.1.0.1
>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 ->
>>> libulockmgr.so.1.0.1
>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so ->
>>> libxml2.so.2.7.2
>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 ->
>>> libxml2.so.2.7.2
>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>> [tsakai_at_ip-10-203-21-132 ~]$ # Now, I am really confused...
>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>
>>> Do you know why it's complaining about shared libraries?
>>>
>>> Thank you.
>>>
>>> Tena
>>>
>>>
>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>
>>>> Your prior mails were about ssh issues, but this one sounds like you might
>>>> have firewall issues.
>>>>
>>>> That is, the "orted" command attempts to open a TCP socket back to mpirun for
>>>> various command and control reasons. If it is blocked from doing so by a
>>>> firewall, Open MPI won't run. In general, you can either disable your
>>>> firewall or you can setup a trust relationship for TCP connections within
>>>> your
>>>> cluster.
>>>>
>>>>
>>>>
>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>
>>>>> Hi Reuti,
>>>>>
>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete
>>>>> session is captured in the attached file.
>>>>>
>>>>> What I did is much similar to what I have done before: verify
>>>>> that ssh works and then run mpirun command. In my a bit lengthy
>>>>> session log, there are two responses from "LogLevel DEBUG3." First
>>>>> from an scp invocation and then from mpirun invocation. They both
>>>>> say
>>>>> debug1: Authentication succeeded (publickey).
>>>>>
>>>>>> From mpirun invocation, I see a line:
>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>> After that there was hanging and I controlled-C out of it, which
>>>>> gave me more lines. But the lines after
>>>>> debug1: Sending command: orted bla bla bla
>>>>> doesn't look good to me. But, in truth, I have no idea what they
>>>>> mean.
>>>>>
>>>>> If you could shed some light, I would appreciate it very much.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tena
>>>>>
>>>>>
>>>>> On 2/10/11 10:57 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>
>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>> No, my environment is entirely linux. The path to my home
>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>> itself as /home/tsakai). For historical reasons, I have
>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>> okay. Sometimes the protection of the home directory must be adjusted too,
>>>>>> but
>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>
>>>>>>
>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>> ssh-keys for each user.
>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I
>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names
>>>>>>> change from a launch to another. I am using passphrasesless
>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>> Amazon node B back to A. (Please see my initail post. There
>>>>>>> is a session dialogue for this.) They all work without authen-
>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>> can't be established.
>>>>>>> RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>> Are you sure you want to continue connecting (yes/no)?
>>>>>>> to which I say "yes."
>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>> Doesn't that mean with password? If so, it is not an option.
>>>>>> No. It's convenient inside a private cluster as it won't fill each users'
>>>>>> known_hosts file and you don't need to create any ssh-keys. But when the
>>>>>> hostname changes every time it might also create new hostkeys. It uses
>>>>>> hostkeys (private and public), this way it works for all users. Just for
>>>>>> reference:
>>>>>>
>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>
>>>>>> You could look into it later.
>>>>>>
>>>>>> ==
>>>>>>
>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh
>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>
>>>>>> - What about putting:
>>>>>>
>>>>>> LogLevel DEBUG3
>>>>>>
>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate
>>>>>> before
>>>>>> it fails in verbose mode.
>>>>>>
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tena
>>>>>>>
>>>>>>>
>>>>>>> On 2/10/11 2:27 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> your local machine is Linux like, but the execution hosts are Macs? I saw
>>>>>>>> the
>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>
>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>
>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks like:
>>>>>>>>> # machine A
>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>> This is just an abbreviation or nickname above. To use the specified
>>>>>>>> settings,
>>>>>>>> it's necessary to specify exactly this name. When the settings are the
>>>>>>>> same
>>>>>>>> anyway for all machines, you can use:
>>>>>>>>
>>>>>>>> Host *
>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>> IdentitiesOnly yes
>>>>>>>> BatchMode yes
>>>>>>>>
>>>>>>>> instead.
>>>>>>>>
>>>>>>>> Is this a private cluster (or at least private interfaces)? It would also
>>>>>>>> be
>>>>>>>> an option to use hostbased authentication, which will avoid setting any
>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>
>>>>>>>> -- Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>> HostName domU-12-31-39-07-35-21
>>>>>>>>> BatchMode yes
>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>
>>>>>>>>> # machine B
>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>> HostName domU-12-31-39-06-74-E2
>>>>>>>>> BatchMode yes
>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>
>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>
>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>> [tsakai_at_domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>
>>>>>>>>> It hungs. I control-C out of it and I get:
>>>>>>>>> mpirun: killing job...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> ------------------------------------------------------------------------->>>>>>
>>> -
>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>>>>> that caused that situation.
>>>>>>>>>
>>>>>>>>>
>> ------------------------------------------------------------------------->>>>>>
>>> -
>>>>>>>>>
>> ------------------------------------------------------------------------->>>>>>
>>> -
>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>>>
>>>>>>>>>
>> ------------------------------------------------------------------------->>>>>>
>>> -
>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not report
>>>>>>>>> back when launched
>>>>>>>>>
>>>>>>>>> Am I making progress?
>>>>>>>>>
>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>> problem?
>>>>>>>>> Does someone have an example .ssh/config file I can look at? There are
>>>>>>>>> so
>>>>>>>>> many keyword-argument paris for this config file and I would like to
>>>>>>>>> look
>>>>>>>>> at
>>>>>>>>> some very basic one that works.
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> Tena Sakai
>>>>>>>>> tsakai_at_[hidden]
>>>>>>>>>
>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>> [tsakai_at_vixen local]$ cat app.ac1
>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>
>>>>>>>>>> The program I run is
>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>
>>>>>>>>>> Here¹s the program fib.R:
>>>>>>>>>> [ tsakai_at_vixen local]$ cat fib.R
>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>
>>>>>>>>>> fib <- function( n ) {
>>>>>>>>>> a <- 0
>>>>>>>>>> b <- 1
>>>>>>>>>> for ( i in 1:n ) {
>>>>>>>>>> t <- b
>>>>>>>>>> b <- a
>>>>>>>>>> a <- a + t
>>>>>>>>>> }
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>> arg <- commandArgs( TRUE )
>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>> cat( fib(arg), myHost, '\n' )
>>>>>>>>>>
>>>>>>>>>> It reads an argument from command line and produces a fibonacci number
>>>>>>>>>> that
>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty simple
>>>>>>>>>> stuff.
>>>>>>>>>>
>>>>>>>>>> Here¹s the run output:
>>>>>>>>>> [tsakai_at_vixen local]$ mpirun -app app.ac1
>>>>>>>>>> 5 vixen.egcrc.org
>>>>>>>>>> 8 vixen.egcrc.org
>>>>>>>>>> 13 blitzen.egcrc.org
>>>>>>>>>> 21 blitzen.egcrc.org
>>>>>>>>>>
>>>>>>>>>> Which is exactly what I expect. So far so good.
>>>>>>>>>>
>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of the
>>>>>>>>>> same
>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>> [tsakai_at_vixen local]$ ssh ­A I ~/.ssh/tsakai
>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>
>>>>>>>>>> Now I am on machine A:
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>> without
>>>>>>>>>> password authentication,
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine A
>>>>>>>>>> without using password
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>> can't
>>>>>>>>>> be established.
>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the list
>>>>>>>>>> of
>>>>>>>>>> known hosts.
>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>> logout
>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>> logout
>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>
>>>>>>>>>> As you can see, neither machine uses password for authentication; it
>>>>>>>>>> uses
>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for ssh
>>>>>>>>>> invocation
>>>>>>>>>> from one machine to the other. This is so because I have a copy of
>>>>>>>>>> public
>>>>>>>>>> key
>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>
>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>
>>>>>>>>>> Here¹s what happens with mpirun:
>>>>>>>>>>
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password:
>>>>>>>>>> Permission denied, please try again.
>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>> ----------------------------------------------------------------------->>>>>>>>
>> -
>>>>>>>>>> --
>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>> process
>>>>>>>>>> that caused that situation.
>>>>>>>>>>
>>>>>>>>>>
>> ----------------------------------------------------------------------->>>>>>>>
>> -
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>
>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>
>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have.
>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>
>>>>>>>>>> Here¹s my question:
>>>>>>>>>> How can I get past authentication by mpirun where there is no password?
>>>>>>>>>>
>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>> Tena Sakai
>>>>>>>>>> tsakai_at_[hidden]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> <session4Reuti.text>_______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users