Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Gus Correa (gus_at_[hidden])
Date: 2011-02-14 16:47:54


Hi Tena

Answers inline.
This is getting big!

Tena Sakai wrote:
> Hi Gus,
>
> Thank you for your reply, comments, and suggestions.
>
> EC2 does have support, but it is with extra charge and I am
> discouraged to use it for budgetary reasons. Also, I have
> heard that their support is a bit toward virtualization
> and amazon environment specific. I may have to override
> all these and ask them for a help...
>

Ask them at least for super-saving free shipping.
I do this all the time.
The downside is that delivery takes 4-9 business days at least ... :)

> (Incidentally, I really like openmpi mailing list. The
> atmosphere you people generate and sustain is quite wonderful

+1 4 that!

The best friendly+knowledgeable+helpful combination amongst
all 10+ mailing lists I subscribe to.
Kudos to Jeff, Ralph, and the other developers for keeping it this way.

> and I hope one day
> I can be a contributing member.)
>

How about your bringing the world of MPI and cloud computing
into the list with your ongoing postings?
I think it is a sound contribution.
When you get it to work, maybe you can write up a little 'HowTo'. :)

> As to your suggestions:
> 1) This is a good idea. I will do hostname via mpirun. Increasing
> complexity from the simplest will probably reveal something I
> don't know.
> 2) I have run R serially on EC2 with this very ami. I have not seen
> any problem and many others have done the same.
>
> Also, here is an idea I came up in my sleep that I want to check
> out.

I hope Amazon EC2 is not making nightmares out of your dreams.

> The ami I have been using is a centos 5.5, which I have built
> from ground up. EC2 has something called Amazon Linux ami. I
> don't know what distribution that is and I am sure it doesn't have
> R, nor openmpi. But I thought I would load these components I
> need to the Amazon Linux (again as you suggest by starting the
> simplest case) and see if I can reproduce the behavior I have
> been experiencing on different (and Amazon "official" ami).
>

That would be a possible way to go, but given my EC2-blindness,
I can all but wonder if it has chances to work.
I have yet to understand whether you copy your compiled tools
(OpenMPI, R, etc) from your local machines to EC2,
or if you build/compile them directly on the EC2 environment.
Also, it's not clear to me if the OS in EC2 is an image
from your local machines' OS/Linux distro, or independent of them,
or if you can choose to have it either way.

On another posting, Ashley Pittman reported to
be using OpenMPI in Amazon EC2 without problems,
suggests pathway and gives several tips for that.
That is probably a more promising path,
which you may want to try.

> I will report as I discover interesting/relevant finding.
>
> Regards,
>
> Tena
>

Best,
Gus Correa

>
> On 2/12/11 9:42 PM, "Gustavo Correa" <gus_at_[hidden]> wrote:
>
>> Hi Tena
>>
>> Thank you for taking the time to explain the details of
>> the EC2 procedure.
>>
>> I am afraid everything in my bag of tricks was used.
>> As Ralph and Jeff suggested, this seems to be a very specific
>> problem with EC2.
>>
>> The difference in behavior when you run as root vs. when you
>> run as Tena, tells that there is some use restriction to regular users
>> in EC2 that isn't present in common machines (Linux or other), I guess.
>> This may be yet another 'stone to turn', as you like to say.
>> It also suggests that there is nothing wrong in principle with your
>> openMPI setup or with your program, otherwise root would not be able to run
>> it.
>>
>> Besides Ralph suggestion of trying the EC2 mailing list archive,
>> I wonder if EC2 has any type of user support where you could ask
>> for help.
>> After all, it is a paid sevice, isn't it?
>> (OpenMPI is not paid and has a great customer service, doesn't it? :) )
>> You have a well documented case to present,
>> and the very peculiar fact that the program fails for normal users but runs
>> for root.
>> This should help the EC2 support to start looking for a solution.
>>
>> I am running out of suggestions of what you could try on your own.
>> But let me try:
>>
>> 1) You may try to reduce the problem to its less common denominator,
>> perhaps by trying to run non-R based MPI programs on EC2, maybe the hello_c.c,
>> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
>> This would be to avoid the extra layer of complexity introduced by R.
>> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2 hostname).
>> I.e. go in a progression of increasing complexity, see where you hit the wall.
>> This may shed some light on what is going on.
>>
>> I don't know if this suggestion may really help, though.
>> It is not clear to me where the thing fails, whether it is during program
>> execution,
>> or while mpiexec is setting up the environment for the program to run.
>> If it is very early in the process, before the program starts, my suggestion
>> won't work.
>> Jeff and Ralph, who know OpenMPI inside out, may have better advice in this
>> regard.
>>
>> 2) Another thing would be to try to run R on E2C in serial mode, without
>> mpiexec,
>> interactively or via script, to see who EC2 doesn't like: R or OpenMPI (but
>> maybe it's both).
>>
>> Gus Correa
>>
>> On Feb 11, 2011, at 9:54 PM, Tena Sakai wrote:
>>
>>> Hi Gus,
>>>
>>> Thank you for your tips.
>>>
>>> I didn't find any smoking gun or anything comes close.
>>> Here's the upshot:
>>>
>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 61504
>>> max locked memory (kbytes, -l) 32
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 1024
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) 8192
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) 61504
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>> bash-3.2#
>>> bash-3.2# ulimit -a
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 61504
>>> max locked memory (kbytes, -l) 32
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 1024
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) 8192
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) unlimited
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>> bash-3.2#
>>> bash-3.2#
>>> bash-3.2# ulimit -a > root_ulimit-a
>>> bash-3.2# exit
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>>> 14c14
>>> < max user processes (-u) unlimited
>>> ---
>>>> max user processes (-u) 61504
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
>>> /proc/sys/fs/file-max
>>> 480 0 762674
>>> 762674
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>> bash-3.2#
>>> bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>> 512 0 762674
>>> 762674
>>> bash-3.2# exit
>>> exit
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>>> -bash: sysctl: command not found
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ /sbin/!!
>>> /sbin/sysctl -a |grep fs.file-max
>>> error: permission denied on key 'kernel.cad_pid'
>>> error: permission denied on key 'kernel.cap-bound'
>>> fs.file-max = 762674
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
>>> fs.file-max = 762674
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>
>>> I see a bit of difference between root and tsakai, but I cannot
>>> believe such small difference results in somewhat a catastrophic
>>> failure as I have reported. Would you agree with me?
>>>
>>> Regards,
>>>
>>> Tena
>>>
>>> On 2/11/11 6:06 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>
>>>> Hi Tena
>>>>
>>>> Please read one answer inline.
>>>>
>>>> Tena Sakai wrote:
>>>>> Hi Jeff,
>>>>> Hi Gus,
>>>>>
>>>>> Thanks for your replies.
>>>>>
>>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>>>>> as identical to that of root. In that setting I reproduced the
>>>>> same result as before: root can run mpirun correctly and tsakai
>>>>> cannot.
>>>>>
>>>>> I have also checked out permission on /tmp directory. tsakai has
>>>>> no problem creating files under /tmp.
>>>>>
>>>>> I am trying to come up with a strategy to show that each and every
>>>>> programs in the PATH has "world" executable permission. It is a
>>>>> stone to turn over, but I am not holding my breath.
>>>>>
>>>>>> ... you are running out of file descriptors. Are file descriptors
>>>>>> limited on a per-process basis, perchance?
>>>>> I have never heard there is such restriction on Amazon EC2. There
>>>>> are folks who keep running instances for a long, long time. Whereas
>>>>> in my case, I launch 2 instances, check things out, and then turn
>>>>> the instances off. (Given that the state of California has a huge
>>>>> debts, our funding is very tight.) So, I really doubt that's the
>>>>> case. I have run mpirun unsuccessfully as user tsakai and immediately
>>>>> after successfully as root. Still, I would be happy if you can tell
>>>>> me a way to tell number of file descriptors used or remmain.
>>>>>
>>>>> Your mentioned file descriptors made me think of something under
>>>>> /dev. But I don't know exactly what I am fishing. Do you have
>>>>> some suggestions?
>>>>>
>>>> 1) If the environment has anything to do with Linux,
>>>> check:
>>>>
>>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>>>
>>>>
>>>> or
>>>>
>>>> sysctl -a |grep fs.file-max
>>>>
>>>> This max can be set (fs.file-max=whatever_is_reasonable)
>>>> in /etc/sysctl.conf
>>>>
>>>> See 'man sysctl' and 'man sysctl.conf'
>>>>
>>>> 2) Another possible source of limits.
>>>>
>>>> Check "ulimit -a" (bash) or "limit" (tcsh).
>>>>
>>>> If you need to change look at:
>>>>
>>>> /etc/security/limits.conf
>>>>
>>>> (See also 'man limits.conf')
>>>>
>>>> **
>>>>
>>>> Since "root can but Tena cannot",
>>>> I would check 2) first,
>>>> as they are the 'per user/per group' limits,
>>>> whereas 1) is kernel/system-wise.
>>>>
>>>> I hope this helps,
>>>> Gus Correa
>>>>
>>>> PS - I know you are a wise and careful programmer,
>>>> but here we had cases of programs that would
>>>> fail because of too many files that were open and never closed,
>>>> eventually exceeding the max available/permissible.
>>>> So, it does happen.
>>>>
>>>>> I wish I could reproduce this (weired) behavior on a different
>>>>> set of machines. I certainly cannot in my local environment. Sigh!
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tena
>>>>>
>>>>>
>>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>>>>
>>>>>> It is concerning if the pipe system call fails - I can't think of why that
>>>>>> would happen. Thats not usually a permissions issue but rather a deeper
>>>>>> indication that something is either seriously wrong on your system or you
>>>>>> are
>>>>>> running out of file descriptors. Are file descriptors limited on a
>>>>>> per-process
>>>>>> basis, perchance?
>>>>>>
>>>>>> Sent from my PDA. No type good.
>>>>>>
>>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>>>>
>>>>>>> Hi Tena
>>>>>>>
>>>>>>> Since root can but you can't,
>>>>>>> is is a directory permission problem perhaps?
>>>>>>> Check the execution directory permission (on both machines,
>>>>>>> if this is not NFS mounted dir).
>>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for
>>>>>>> under-the-hood stuff, worth checking permissions there also.
>>>>>>> Just a naive guess.
>>>>>>>
>>>>>>> Congrats for all the progress with the cloudy MPI!
>>>>>>>
>>>>>>> Gus Correa
>>>>>>>
>>>>>>> Tena Sakai wrote:
>>>>>>>> Hi,
>>>>>>>> I have made a bit more progress. I think I can say ssh authenti-
>>>>>>>> cation problem is behind me now. I am still having a problem running
>>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that
>>>>>>>> I can run mpirun as root. Here's the session log:
>>>>>>>> [tsakai_at_vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>>>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll
>>>>>>>> total 8
>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll .ssh
>>>>>>>> total 16
>>>>>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys
>>>>>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config
>>>>>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>>>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>>>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # I am on machine B
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ hostname
>>>>>>>> ip-10-100-243-195
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ ll
>>>>>>>> total 8
>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ cat app.ac
>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # go back to machine A
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ exit
>>>>>>>> logout
>>>>>>>> Connection to ip-10-100-243-195.ec2.internal closed.
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ hostname
>>>>>>>> ip-10-195-198-31
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # Execute mpirun
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> mpirun was unable to launch the specified application as it encountered
>>>>>>>> an
>>>>>>>> error:
>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>> subsystem
>>>>>>>> Node: ip-10-195-198-31
>>>>>>>> while attempting to start process rank 0.
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it as root
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ sudo su
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# pwd
>>>>>>>> /home/tsakai
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# ls -l /root/.ssh/config
>>>>>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# cat /root/.ssh/config
>>>>>>>> Host *
>>>>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>>>>> IdentitiesOnly yes
>>>>>>>> BatchMode yes
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# pwd
>>>>>>>> /home/tsakai
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# ls -l
>>>>>>>> total 8
>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# # now is the time for mpirun
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# mpirun --app ./app.ac
>>>>>>>> 13 ip-10-100-243-195
>>>>>>>> 21 ip-10-100-243-195
>>>>>>>> 5 ip-10-195-198-31
>>>>>>>> 8 ip-10-195-198-31
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# # It works (being root)!
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# exit
>>>>>>>> exit
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun --app app.ac
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> mpirun was unable to launch the specified application as it encountered
>>>>>>>> an
>>>>>>>> error:
>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>> subsystem
>>>>>>>> Node: ip-10-195-198-31
>>>>>>>> while attempting to start process rank 0.
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # I don't get it.
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ exit
>>>>>>>> logout
>>>>>>>> [tsakai_at_vixen ec2]$
>>>>>>>> So, why does it say "pipe function call failed when setting up
>>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>>>>>>>> The node it is referring to is not the remote machine. It is
>>>>>>>> What I call machine A. I first thought maybe this is a problem
>>>>>>>> With PATH variable. But I don't think so. I compared root's
>>>>>>>> Path to that of tsaki's and made them identical and retried.
>>>>>>>> I got the same behavior.
>>>>>>>> If you could enlighten me why this is happening, I would really
>>>>>>>> Appreciate it.
>>>>>>>> Thank you.
>>>>>>>> Tena
>>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>> Hi jeff,
>>>>>>>>>
>>>>>>>>> Thanks for the firewall tip. I tried it while allowing all tip traffic
>>>>>>>>> and got interesting and preplexing result. Here's what's interesting
>>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):
>>>>>>>>>
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>> Host key verification failed.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>> A daemon (pid 2743) died unexpectedly with status 255 while attempting
>>>>>>>>> to launch so we are aborting.
>>>>>>>>>
>>>>>>>>> There may be more information reported by the environment (see above).
>>>>>>>>>
>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>> shared
>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>>>>>> the
>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>>>>> that caused that situation.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ env | grep LD_LIB
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>>>>>>>> /usr/local/lib
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I better to this on machine B as well
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>>>>>>>> Warning: Identity file tsakai not accessible: No such file or
>>>>>>>>> directory.
>>>>>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ env | grep LD_LIB
>>>>>>>>> LD_LIBRARY_PATH=/usr/local/lib
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ exit
>>>>>>>>> logout
>>>>>>>>> Connection to ip-10-195-171-159 closed.
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ hostname
>>>>>>>>> ip-10-203-21-132
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # try mpirun again
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>> Host key verification failed.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>> A daemon (pid 2789) died unexpectedly with status 255 while attempting
>>>>>>>>> to launch so we are aborting.
>>>>>>>>>
>>>>>>>>> There may be more information reported by the environment (see above).
>>>>>>>>>
>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>> shared
>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>>>>>> the
>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>>>>> that caused that situation.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>> -
>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>>>>>>>> /usr/local/lib...
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>>>>>>>> total 16604
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so ->
>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 ->
>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so ->
>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 ->
>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so ->
>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 ->
>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so ->
>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 ->
>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so ->
>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 ->
>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so ->
>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 ->
>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so ->
>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 ->
>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so ->
>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 ->
>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so ->
>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 ->
>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so ->
>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 ->
>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so ->
>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 ->
>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Now, I am really confused...
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>
>>>>>>>>> Do you know why it's complaining about shared libraries?
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> Tena
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like you
>>>>>>>>>> might
>>>>>>>>>> have firewall issues.
>>>>>>>>>>
>>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to
>>>>>>>>>> mpirun
>>>>>>>>>> for
>>>>>>>>>> various command and control reasons. If it is blocked from doing so
>>>>>>>>>> by
>>>>>>>>>> a
>>>>>>>>>> firewall, Open MPI won't run. In general, you can either disable your
>>>>>>>>>> firewall or you can setup a trust relationship for TCP connections
>>>>>>>>>> within
>>>>>>>>>> your
>>>>>>>>>> cluster.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Reuti,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete
>>>>>>>>>>> session is captured in the attached file.
>>>>>>>>>>>
>>>>>>>>>>> What I did is much similar to what I have done before: verify
>>>>>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy
>>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First
>>>>>>>>>>> from an scp invocation and then from mpirun invocation. They both
>>>>>>>>>>> say
>>>>>>>>>>> debug1: Authentication succeeded (publickey).
>>>>>>>>>>>
>>>>>>>>>>>> From mpirun invocation, I see a line:
>>>>>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca
>>>>>>>>>>> orte_ess_num_procs
>>>>>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>>>>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>>>>>>>> After that there was hanging and I controlled-C out of it, which
>>>>>>>>>>> gave me more lines. But the lines after
>>>>>>>>>>> debug1: Sending command: orted bla bla bla
>>>>>>>>>>> doesn't look good to me. But, in truth, I have no idea what they
>>>>>>>>>>> mean.
>>>>>>>>>>>
>>>>>>>>>>> If you could shed some light, I would appreciate it very much.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Tena
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>>>>>>>
>>>>>>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>>>>>>>> No, my environment is entirely linux. The path to my home
>>>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>>>>>>>> itself as /home/tsakai). For historical reasons, I have
>>>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>>>>>>>> okay. Sometimes the protection of the home directory must be
>>>>>>>>>>>> adjusted
>>>>>>>>>>>> too,
>>>>>>>>>>>> but
>>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>>>>>>> ssh-keys for each user.
>>>>>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I
>>>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names
>>>>>>>>>>>>> change from a launch to another. I am using passphrasesless
>>>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>>>>>>>> Amazon node B back to A. (Please see my initail post. There
>>>>>>>>>>>>> is a session dialogue for this.) They all work without authen-
>>>>>>>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>>>>>>>> can't be established.
>>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)?
>>>>>>>>>>>>> to which I say "yes."
>>>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>>>>>>>> Doesn't that mean with password? If so, it is not an option.
>>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each
>>>>>>>>>>>> users'
>>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But when
>>>>>>>>>>>> the
>>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It
>>>>>>>>>>>> uses
>>>>>>>>>>>> hostkeys (private and public), this way it works for all users. Just
>>>>>>>>>>>> for
>>>>>>>>>>>> reference:
>>>>>>>>>>>>
>>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>>>>>>>
>>>>>>>>>>>> You could look into it later.
>>>>>>>>>>>>
>>>>>>>>>>>> ==
>>>>>>>>>>>>
>>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh
>>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>>>>>>>
>>>>>>>>>>>> - What about putting:
>>>>>>>>>>>>
>>>>>>>>>>>> LogLevel DEBUG3
>>>>>>>>>>>>
>>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to
>>>>>>>>>>>> negotiate
>>>>>>>>>>>> before
>>>>>>>>>>>> it fails in verbose mode.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs?
>>>>>>>>>>>>> I
>>>>>>>>>>>>> saw
>>>>>>>>>>>>> the
>>>>>>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>>>>>>
>>>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks
>>>>>>>>>>>>> like:
>>>>>>>>>>>>> # machine A
>>>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>>>>>>> This is just an abbreviation or nickname above. To use the
>>>>>>>>>>>>> specified
>>>>>>>>>>>>> settings,
>>>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> anyway for all machines, you can use:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Host *
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It
>>>>>>>>>>>>> would
>>>>>>>>>>>>> also
>>>>>>>>>>>>> be
>>>>>>>>>>>>> an option to use hostbased authentication, which will avoid setting
>>>>>>>>>>>>> any
>>>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> HostName domU-12-31-39-07-35-21
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> # machine B
>>>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>>>>>> HostName domU-12-31-39-06-74-E2
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>>>>>
>>>>>>>>>>>>> It hungs. I control-C out of it and I get:
>>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>>> process
>>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>>>>>>>>> shown
>>>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not
>>>>>>>>>>>>> report
>>>>>>>>>>>>> back when launched
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I making progress?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>>>>>> problem?
>>>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There
>>>>>>>>>>>>> are
>>>>>>>>>>>>> so
>>>>>>>>>>>>> many keyword-argument paris for this config file and I would like
>>>>>>>>>>>>> to
>>>>>>>>>>>>> look
>>>>>>>>>>>>> at
>>>>>>>>>>>>> some very basic one that works.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>>> tsakai_at_[hidden]
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ cat app.ac1
>>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> The program I run is
>>>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s the program fib.R:
>>>>>>>>>>>>> [ tsakai_at_vixen local]$ cat fib.R
>>>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>>>>
>>>>>>>>>>>>> fib <- function( n ) {
>>>>>>>>>>>>> a <- 0
>>>>>>>>>>>>> b <- 1
>>>>>>>>>>>>> for ( i in 1:n ) {
>>>>>>>>>>>>> t <- b
>>>>>>>>>>>>> b <- a
>>>>>>>>>>>>> a <- a + t
>>>>>>>>>>>>> }
>>>>>>>>>>>>> a
>>>>>>>>>>>>>
>>>>>>>>>>>>> arg <- commandArgs( TRUE )
>>>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>>>>> cat( fib(arg), myHost, '\n' )
>>>>>>>>>>>>>
>>>>>>>>>>>>> It reads an argument from command line and produces a fibonacci
>>>>>>>>>>>>> number
>>>>>>>>>>>>> that
>>>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty
>>>>>>>>>>>>> simple
>>>>>>>>>>>>> stuff.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s the run output:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ mpirun -app app.ac1
>>>>>>>>>>>>> 5 vixen.egcrc.org
>>>>>>>>>>>>> 8 vixen.egcrc.org
>>>>>>>>>>>>> 13 blitzen.egcrc.org
>>>>>>>>>>>>> 21 blitzen.egcrc.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which is exactly what I expect. So far so good.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ ssh ­A I ~/.ssh/tsakai
>>>>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I am on machine A:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>>>>> without
>>>>>>>>>>>>> password authentication,
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
>>>>>>>>>>>>> A
>>>>>>>>>>>>> without using password
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>>>>> can't
>>>>>>>>>>>>> be established.
>>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
>>>>>>>>>>>>> list
>>>>>>>>>>>>> of
>>>>>>>>>>>>> known hosts.
>>>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>>>>> logout
>>>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>>>>> logout
>>>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you can see, neither machine uses password for authentication;
>>>>>>>>>>>>> it
>>>>>>>>>>>>> uses
>>>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for
>>>>>>>>>>>>> ssh
>>>>>>>>>>>>> invocation
>>>>>>>>>>>>> from one machine to the other. This is so because I have a copy of
>>>>>>>>>>>>> public
>>>>>>>>>>>>> key
>>>>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s what happens with mpirun:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password:
>>>>>>>>>>>>> Permission denied, please try again.
>>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ----------------------------------------------------------------------->
>>>>>>>>>>
>>>>>>>> -
>>>>>>>>>>>>> --
>>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>>> process
>>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ----------------------------------------------------------------------->
>>>>>>>>>>
>>>>>>>> -
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>>>
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have.
>>>>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s my question:
>>>>>>>>>>>>> How can I get past authentication by mpirun where there is no
>>>>>>>>>>>>> password?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>>> tsakai_at_[hidden]