Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Gus Correa (gus_at_[hidden])
Date: 2011-02-11 21:06:20


Hi Tena

Please read one answer inline.

Tena Sakai wrote:
> Hi Jeff,
> Hi Gus,
>
> Thanks for your replies.
>
> I have pretty much ruled out PATH issues by setting tsakai's PATH
> as identical to that of root. In that setting I reproduced the
> same result as before: root can run mpirun correctly and tsakai
> cannot.
>
> I have also checked out permission on /tmp directory. tsakai has
> no problem creating files under /tmp.
>
> I am trying to come up with a strategy to show that each and every
> programs in the PATH has "world" executable permission. It is a
> stone to turn over, but I am not holding my breath.
>
>> ... you are running out of file descriptors. Are file descriptors
>> limited on a per-process basis, perchance?
>
> I have never heard there is such restriction on Amazon EC2. There
> are folks who keep running instances for a long, long time. Whereas
> in my case, I launch 2 instances, check things out, and then turn
> the instances off. (Given that the state of California has a huge
> debts, our funding is very tight.) So, I really doubt that's the
> case. I have run mpirun unsuccessfully as user tsakai and immediately
> after successfully as root. Still, I would be happy if you can tell
> me a way to tell number of file descriptors used or remmain.
>
> Your mentioned file descriptors made me think of something under
> /dev. But I don't know exactly what I am fishing. Do you have
> some suggestions?
>

1) If the environment has anything to do with Linux,
check:

cat /proc/sys/fs/file-nr /proc/sys/fs/file-max

or

sysctl -a |grep fs.file-max

This max can be set (fs.file-max=whatever_is_reasonable)
in /etc/sysctl.conf

See 'man sysctl' and 'man sysctl.conf'

2) Another possible source of limits.

Check "ulimit -a" (bash) or "limit" (tcsh).

If you need to change look at:

/etc/security/limits.conf

(See also 'man limits.conf')

**

Since "root can but Tena cannot",
I would check 2) first,
as they are the 'per user/per group' limits,
whereas 1) is kernel/system-wise.

I hope this helps,
Gus Correa

PS - I know you are a wise and careful programmer,
but here we had cases of programs that would
fail because of too many files that were open and never closed,
eventually exceeding the max available/permissible.
So, it does happen.

> I wish I could reproduce this (weired) behavior on a different
> set of machines. I certainly cannot in my local environment. Sigh!
>
> Regards,
>
> Tena
>
>
> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>
>> It is concerning if the pipe system call fails - I can't think of why that
>> would happen. Thats not usually a permissions issue but rather a deeper
>> indication that something is either seriously wrong on your system or you are
>> running out of file descriptors. Are file descriptors limited on a per-process
>> basis, perchance?
>>
>> Sent from my PDA. No type good.
>>
>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <gus_at_[hidden]> wrote:
>>
>>> Hi Tena
>>>
>>> Since root can but you can't,
>>> is is a directory permission problem perhaps?
>>> Check the execution directory permission (on both machines,
>>> if this is not NFS mounted dir).
>>> I am not sure, but IIRR OpenMPI also uses /tmp for
>>> under-the-hood stuff, worth checking permissions there also.
>>> Just a naive guess.
>>>
>>> Congrats for all the progress with the cloudy MPI!
>>>
>>> Gus Correa
>>>
>>> Tena Sakai wrote:
>>>> Hi,
>>>> I have made a bit more progress. I think I can say ssh authenti-
>>>> cation problem is behind me now. I am still having a problem running
>>>> mpirun, but the latest discovery, which I can reproduce, is that
>>>> I can run mpirun as root. Here's the session log:
>>>> [tsakai_at_vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll
>>>> total 8
>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll .ssh
>>>> total 16
>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys
>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config
>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>> [tsakai_at_ip-10-100-243-195 ~]$ # I am on machine B
>>>> [tsakai_at_ip-10-100-243-195 ~]$ hostname
>>>> ip-10-100-243-195
>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>> [tsakai_at_ip-10-100-243-195 ~]$ ll
>>>> total 8
>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>> [tsakai_at_ip-10-100-243-195 ~]$ cat app.ac
>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>> [tsakai_at_ip-10-100-243-195 ~]$ # go back to machine A
>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>> [tsakai_at_ip-10-100-243-195 ~]$ exit
>>>> logout
>>>> Connection to ip-10-100-243-195.ec2.internal closed.
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ hostname
>>>> ip-10-195-198-31
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ # Execute mpirun
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to launch the specified application as it encountered an
>>>> error:
>>>> Error: pipe function call failed when setting up I/O forwarding subsystem
>>>> Node: ip-10-195-198-31
>>>> while attempting to start process rank 0.
>>>> --------------------------------------------------------------------------
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it as root
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ sudo su
>>>> bash-3.2#
>>>> bash-3.2# pwd
>>>> /home/tsakai
>>>> bash-3.2#
>>>> bash-3.2# ls -l /root/.ssh/config
>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>>> bash-3.2#
>>>> bash-3.2# cat /root/.ssh/config
>>>> Host *
>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>> IdentitiesOnly yes
>>>> BatchMode yes
>>>> bash-3.2#
>>>> bash-3.2# pwd
>>>> /home/tsakai
>>>> bash-3.2#
>>>> bash-3.2# ls -l
>>>> total 8
>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>> bash-3.2#
>>>> bash-3.2# # now is the time for mpirun
>>>> bash-3.2#
>>>> bash-3.2# mpirun --app ./app.ac
>>>> 13 ip-10-100-243-195
>>>> 21 ip-10-100-243-195
>>>> 5 ip-10-195-198-31
>>>> 8 ip-10-195-198-31
>>>> bash-3.2#
>>>> bash-3.2# # It works (being root)!
>>>> bash-3.2#
>>>> bash-3.2# exit
>>>> exit
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun --app app.ac
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to launch the specified application as it encountered an
>>>> error:
>>>> Error: pipe function call failed when setting up I/O forwarding subsystem
>>>> Node: ip-10-195-198-31
>>>> while attempting to start process rank 0.
>>>> --------------------------------------------------------------------------
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ # I don't get it.
>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>> [tsakai_at_ip-10-195-198-31 ~]$ exit
>>>> logout
>>>> [tsakai_at_vixen ec2]$
>>>> So, why does it say "pipe function call failed when setting up
>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>>>> The node it is referring to is not the remote machine. It is
>>>> What I call machine A. I first thought maybe this is a problem
>>>> With PATH variable. But I don't think so. I compared root's
>>>> Path to that of tsaki's and made them identical and retried.
>>>> I got the same behavior.
>>>> If you could enlighten me why this is happening, I would really
>>>> Appreciate it.
>>>> Thank you.
>>>> Tena
>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>> Hi jeff,
>>>>>
>>>>> Thanks for the firewall tip. I tried it while allowing all tip traffic
>>>>> and got interesting and preplexing result. Here's what's interesting
>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):
>>>>>
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>> Host key verification failed.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid 2743) died unexpectedly with status 255 while attempting
>>>>> to launch so we are aborting.
>>>>>
>>>>> There may be more information reported by the environment (see above).
>>>>>
>>>>> This may be because the daemon was unable to find all the needed shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>> the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun: clean termination accomplished
>>>>>
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ env | grep LD_LIB
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>>>> /usr/local/lib
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I better to this on machine B as well
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>>>> Warning: Identity file tsakai not accessible: No such file or directory.
>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>> [tsakai_at_ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>> [tsakai_at_ip-10-195-171-159 ~]$ env | grep LD_LIB
>>>>> LD_LIBRARY_PATH=/usr/local/lib
>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>> [tsakai_at_ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>>>> [tsakai_at_ip-10-195-171-159 ~]$ exit
>>>>> logout
>>>>> Connection to ip-10-195-171-159 closed.
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ hostname
>>>>> ip-10-203-21-132
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # try mpirun again
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>> Host key verification failed.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid 2789) died unexpectedly with status 255 while attempting
>>>>> to launch so we are aborting.
>>>>>
>>>>> There may be more information reported by the environment (see above).
>>>>>
>>>>> This may be because the daemon was unable to find all the needed shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>> the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun: clean termination accomplished
>>>>>
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>>>> /usr/local/lib...
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>>>> total 16604
>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so ->
>>>>> libfuse.so.2.8.5
>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 ->
>>>>> libfuse.so.2.8.5
>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so ->
>>>>> libmca_common_sm.so.1.0.0
>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 ->
>>>>> libmca_common_sm.so.1.0.0
>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> libmpi.so.0.0.2
>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 ->
>>>>> libmpi.so.0.0.2
>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so ->
>>>>> libmpi_cxx.so.0.0.1
>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 ->
>>>>> libmpi_cxx.so.0.0.1
>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so ->
>>>>> libmpi_f77.so.0.0.1
>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 ->
>>>>> libmpi_f77.so.0.0.1
>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so ->
>>>>> libmpi_f90.so.0.0.1
>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 ->
>>>>> libmpi_f90.so.0.0.1
>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so ->
>>>>> libopen-pal.so.0.0.0
>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 ->
>>>>> libopen-pal.so.0.0.0
>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so ->
>>>>> libopen-rte.so.0.0.0
>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 ->
>>>>> libopen-rte.so.0.0.0
>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so ->
>>>>> libopenmpi_malloc.so.0.0.0
>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 ->
>>>>> libopenmpi_malloc.so.0.0.0
>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so ->
>>>>> libulockmgr.so.1.0.1
>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 ->
>>>>> libulockmgr.so.1.0.1
>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so ->
>>>>> libxml2.so.2.7.2
>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 ->
>>>>> libxml2.so.2.7.2
>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Now, I am really confused...
>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>
>>>>> Do you know why it's complaining about shared libraries?
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Tena
>>>>>
>>>>>
>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>
>>>>>> Your prior mails were about ssh issues, but this one sounds like you might
>>>>>> have firewall issues.
>>>>>>
>>>>>> That is, the "orted" command attempts to open a TCP socket back to mpirun
>>>>>> for
>>>>>> various command and control reasons. If it is blocked from doing so by a
>>>>>> firewall, Open MPI won't run. In general, you can either disable your
>>>>>> firewall or you can setup a trust relationship for TCP connections within
>>>>>> your
>>>>>> cluster.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>>>
>>>>>>> Hi Reuti,
>>>>>>>
>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete
>>>>>>> session is captured in the attached file.
>>>>>>>
>>>>>>> What I did is much similar to what I have done before: verify
>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy
>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First
>>>>>>> from an scp invocation and then from mpirun invocation. They both
>>>>>>> say
>>>>>>> debug1: Authentication succeeded (publickey).
>>>>>>>
>>>>>>>> From mpirun invocation, I see a line:
>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>>>> After that there was hanging and I controlled-C out of it, which
>>>>>>> gave me more lines. But the lines after
>>>>>>> debug1: Sending command: orted bla bla bla
>>>>>>> doesn't look good to me. But, in truth, I have no idea what they
>>>>>>> mean.
>>>>>>>
>>>>>>> If you could shed some light, I would appreciate it very much.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tena
>>>>>>>
>>>>>>>
>>>>>>> On 2/10/11 10:57 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>>>
>>>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>>>> No, my environment is entirely linux. The path to my home
>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>>>> itself as /home/tsakai). For historical reasons, I have
>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>>>> okay. Sometimes the protection of the home directory must be adjusted
>>>>>>>> too,
>>>>>>>> but
>>>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>>>> ssh-keys for each user.
>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I
>>>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names
>>>>>>>>> change from a launch to another. I am using passphrasesless
>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>>>> Amazon node B back to A. (Please see my initail post. There
>>>>>>>>> is a session dialogue for this.) They all work without authen-
>>>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>>>> can't be established.
>>>>>>>>> RSA key fingerprint is
>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>> Are you sure you want to continue connecting (yes/no)?
>>>>>>>>> to which I say "yes."
>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>>>> Doesn't that mean with password? If so, it is not an option.
>>>>>>>> No. It's convenient inside a private cluster as it won't fill each
>>>>>>>> users'
>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But when the
>>>>>>>> hostname changes every time it might also create new hostkeys. It uses
>>>>>>>> hostkeys (private and public), this way it works for all users. Just for
>>>>>>>> reference:
>>>>>>>>
>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>>>
>>>>>>>> You could look into it later.
>>>>>>>>
>>>>>>>> ==
>>>>>>>>
>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh
>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>>>
>>>>>>>> - What about putting:
>>>>>>>>
>>>>>>>> LogLevel DEBUG3
>>>>>>>>
>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate
>>>>>>>> before
>>>>>>>> it fails in verbose mode.
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Tena
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs? I
>>>>>>>>>> saw
>>>>>>>>>> the
>>>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>>>
>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>>>
>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks
>>>>>>>>>>> like:
>>>>>>>>>>> # machine A
>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>>>> This is just an abbreviation or nickname above. To use the specified
>>>>>>>>>> settings,
>>>>>>>>>> it's necessary to specify exactly this name. When the settings are the
>>>>>>>>>> same
>>>>>>>>>> anyway for all machines, you can use:
>>>>>>>>>>
>>>>>>>>>> Host *
>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>> BatchMode yes
>>>>>>>>>>
>>>>>>>>>> instead.
>>>>>>>>>>
>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It would
>>>>>>>>>> also
>>>>>>>>>> be
>>>>>>>>>> an option to use hostbased authentication, which will avoid setting
>>>>>>>>>> any
>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>>>
>>>>>>>>>> -- Reuti
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> HostName domU-12-31-39-07-35-21
>>>>>>>>>>> BatchMode yes
>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>
>>>>>>>>>>> # machine B
>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>>>> HostName domU-12-31-39-06-74-E2
>>>>>>>>>>> BatchMode yes
>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>
>>>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>>>
>>>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>>>> [tsakai_at_domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>>>
>>>>>>>>>>> It hungs. I control-C out of it and I get:
>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>> ------------------------------------------------------------------------->>>
>>>>> -
>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>> process
>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>> ------------------------------------------------------------------------->>>
>>>>> -
>>>> ------------------------------------------------------------------------->>>
>>>>> -
>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>>>>>>> shown
>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>>>>>
>>>>>>>>>>>
>>>> ------------------------------------------------------------------------->>>
>>>>> -
>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not
>>>>>>>>>>> report
>>>>>>>>>>> back when launched
>>>>>>>>>>>
>>>>>>>>>>> Am I making progress?
>>>>>>>>>>>
>>>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>>>> problem?
>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There
>>>>>>>>>>> are
>>>>>>>>>>> so
>>>>>>>>>>> many keyword-argument paris for this config file and I would like to
>>>>>>>>>>> look
>>>>>>>>>>> at
>>>>>>>>>>> some very basic one that works.
>>>>>>>>>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>> Tena Sakai
>>>>>>>>>>> tsakai_at_[hidden]
>>>>>>>>>>>
>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>>>> [tsakai_at_vixen local]$ cat app.ac1
>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>>>
>>>>>>>>>>>> The program I run is
>>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s the program fib.R:
>>>>>>>>>>>> [ tsakai_at_vixen local]$ cat fib.R
>>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>>>
>>>>>>>>>>>> fib <- function( n ) {
>>>>>>>>>>>> a <- 0
>>>>>>>>>>>> b <- 1
>>>>>>>>>>>> for ( i in 1:n ) {
>>>>>>>>>>>> t <- b
>>>>>>>>>>>> b <- a
>>>>>>>>>>>> a <- a + t
>>>>>>>>>>>> }
>>>>>>>>>>>> a
>>>>>>>>>>>>
>>>>>>>>>>>> arg <- commandArgs( TRUE )
>>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>>>> cat( fib(arg), myHost, '\n' )
>>>>>>>>>>>>
>>>>>>>>>>>> It reads an argument from command line and produces a fibonacci
>>>>>>>>>>>> number
>>>>>>>>>>>> that
>>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty
>>>>>>>>>>>> simple
>>>>>>>>>>>> stuff.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s the run output:
>>>>>>>>>>>> [tsakai_at_vixen local]$ mpirun -app app.ac1
>>>>>>>>>>>> 5 vixen.egcrc.org
>>>>>>>>>>>> 8 vixen.egcrc.org
>>>>>>>>>>>> 13 blitzen.egcrc.org
>>>>>>>>>>>> 21 blitzen.egcrc.org
>>>>>>>>>>>>
>>>>>>>>>>>> Which is exactly what I expect. So far so good.
>>>>>>>>>>>>
>>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of
>>>>>>>>>>>> the
>>>>>>>>>>>> same
>>>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>>>> [tsakai_at_vixen local]$ ssh ­A I ~/.ssh/tsakai
>>>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>>>
>>>>>>>>>>>> Now I am on machine A:
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>>>> without
>>>>>>>>>>>> password authentication,
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
>>>>>>>>>>>> A
>>>>>>>>>>>> without using password
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>>>> can't
>>>>>>>>>>>> be established.
>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
>>>>>>>>>>>> list
>>>>>>>>>>>> of
>>>>>>>>>>>> known hosts.
>>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>>>> logout
>>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>>>> logout
>>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>
>>>>>>>>>>>> As you can see, neither machine uses password for authentication; it
>>>>>>>>>>>> uses
>>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for
>>>>>>>>>>>> ssh
>>>>>>>>>>>> invocation
>>>>>>>>>>>> from one machine to the other. This is so because I have a copy of
>>>>>>>>>>>> public
>>>>>>>>>>>> key
>>>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>>>
>>>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s what happens with mpirun:
>>>>>>>>>>>>
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password:
>>>>>>>>>>>> Permission denied, please try again.
>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>> ----------------------------------------------------------------------->>>>>
>>>> -
>>>>>>>>>>>> --
>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>> process
>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>> ----------------------------------------------------------------------->>>>>
>>>> -
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>>
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>
>>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have.
>>>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s my question:
>>>>>>>>>>>> How can I get past authentication by mpirun where there is no
>>>>>>>>>>>> password?
>>>>>>>>>>>>
>>>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>> tsakai_at_[hidden]