Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-02-12 09:06:47


Have you searched the email archive and/or web for openmpi and Amazon cloud? Others have previously worked through many of these problems for that environment - might be worth a look to see if someone already solved this, or at least a contact point for someone who is already running in that environment.

IIRC, there are some unique problems with running on that platform.

On Feb 12, 2011, at 12:38 AM, Tena Sakai wrote:

> Hi Gus,
>
> Thank you for all your suggestions.
>
> I fixed the limits as you suggested and ran the test and
> I am still getting the same failure. More on that in a
> bit. But here is a bit of my response to what you mentioned.
>
>> the IP number you checked now is not the same as in your
>> message with the MPI failure/errors.
>> Not sure if I understand which computers we're talking about,
>> or where these computers are (at Amazon?),
>> or if they change depending on each session you use to run your programs,
>> if they are identical machines with the same limits or if they differ.
>
> Everything I mentioned in last 2-3 days is on Amazon EC2 cloud. I
> have no problem running the same thing locally (vixen is my local
> machine):
>
> [tsakai_at_vixen Rmpi]$ cat app.ac1
> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5
> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6
> -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 7
> -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 8
> [tsakai_at_vixen Rmpi]$
> [tsakai_at_vixen Rmpi]$ mpirun --app app.ac1
> 5 vixen.egcrc.org
> 8 vixen.egcrc.org
> 13 blitzen.egcrc.org
> 21 blitzen.egcrc.org
> [tsakai_at_vixen Rmpi]$ # these lines are correct result.
> [tsakai_at_vixen Rmpi]$
>
> With Amazon EC2, where the strange behavior happens, is a virtualized
> environment. They charge by hours. I launch an instance of a machine
> when I need it and I shut them down when I am done. Each time I get
> different IP addresses (2 per instance, one on internal network and
> the other for public interface). That is why I don't show consistent
> ip address or dns. Every time I shutdown the machine, what I did on
> that instance disappears and on next instance I have to recreate it
> from scratch --case in point is ~/home/.ssh/config--, which is what
> I have been doing (unless I take 'snapshot' of the image and save it
> to a persistent storage (and doing snapshot is a bit of work)).
>
>> One of the error messages mentions LD_LIBRARY_PATH.
>> Is it set to point to the OpenMPI lib directory?
>> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly
>> set.
>
> Yes, I have been setting LD_LIBRARY_PATH manually every time, because
> I have neglected to put it into my bash startup file as part of AMI
> (Amazon Machine Image) building.
>
> Now what I have done is get onto an instance as tsakai, save output
> from 'ulimit -a', set /etc/security/limits.conf parameters as you
> suggest, get off and re-log onto the instance (thereby activating
> those ulimit parameters), and ran the same (actually simpler) test,
> as tsakai and as root.
>
> [tsakai_at_vixen Rmpi]$
> [tsakai_at_vixen Rmpi]$ # 2ec2 below is a script/wrapper around ssh to
> [tsakai_at_vixen Rmpi]$ # make ssh invocation line shorter.
> [tsakai_at_vixen Rmpi]$
> [tsakai_at_vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com
> The authenticity of host 'ec2-50-16-55-64.compute-1.amazonaws.com
> (50.16.55.64)' can't be established.
> RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
> Are you sure you want to continue connecting (yes/no)? yes
> Last login: Tue Feb 8 22:52:54 2011 from 10.201.197.188
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ ulimit -a > mylimit.1
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ sudo su
> bash-3.2#
> bash-3.2# cat - >> /etc/security/limits.conf
> * - memlock -1
> * - stack -1
> * - nofile 4096
> bash-3.2#
> bash-3.2# tail /etc/security/limits.conf
> #@student hard nproc 20
> #@faculty soft nproc 20
> #@faculty hard nproc 50
> #ftp hard nproc 0
> #@student - maxlogins 4
>
> # End of file
> * - memlock -1
> * - stack -1
> * - nofile 4096
> bash-3.2#
> bash-3.2# exit
> exit
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ # logout and log back in to activate the
> [tsakai_at_ip-10-114-138-129 ~]$ # new setting.
> [tsakai_at_ip-10-114-138-129 ~]$ exit
> logout
> [tsakai_at_vixen ec2]$
> [tsakai_at_vixen ec2]$ # I am back on vixen and about to relogging back onto
> [tsakai_at_vixen ec2]$ # the instance which is still running.
> [tsakai_at_vixen ec2]$
> [tsakai_at_vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com
> Last login: Fri Feb 11 23:50:47 2011 from 63.193.205.1
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ ulimit -a > mylimit.2
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ diff mylimit.1 mylimit.2
> 6c6
> < max locked memory (kbytes, -l) 32
> ---
>> max locked memory (kbytes, -l) unlimited
> 8c8
> < open files (-n) 1024
> ---
>> open files (-n) 4096
> 12c12
> < stack size (kbytes, -s) 8192
> ---
>> stack size (kbytes, -s) unlimited
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ # yes, I have the same ulimit parameters as
> [tsakai_at_ip-10-114-138-129 ~]$ # Gus suggested.
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ export LD_LIBRARY_PATH=/usr/local/lib
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ env | grep LD_LIB
> LD_LIBRARY_PATH=/usr/local/lib
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ cat - > app.ac
> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ cat app.ac
> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
> -H ip-10-114-138-129.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ hostname
> ip-10-114-138-129
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ # this run doesn't involve other node.
> [tsakai_at_ip-10-114-138-129 ~]$ # just use this machine's cores.
> [tsakai_at_ip-10-114-138-129 ~]$ # there are 2 cores on this machine.
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ mpirun --app app.ac
> --------------------------------------------------------------------------
> mpirun was unable to launch the specified application as it encountered an
> error:
>
> Error: pipe function call failed when setting up I/O forwarding subsystem
> Node: ip-10-114-138-129
>
> while attempting to start process rank 0.
> --------------------------------------------------------------------------
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ # I still get the same error!
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ cat /proc/sys/fs/file-nr
> 512 0 762674
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ # number of open files (512) is no where
> [tsakai_at_ip-10-114-138-129 ~]$ # close to the limit, which is 4096 now.
> [tsakai_at_ip-10-114-138-129 ~]$ # now let's run it as root.
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ sudo su
> bash-3.2#
> bash-3.2# env | grep LD_LIBR
> LD_LIBRARY_PATH=/usr/local/lib
> bash-3.2#
> bash-3.2# pwd
> /home/tsakai
> bash-3.2#
> bash-3.2# mpirun --app ./app.ac
> 5 ip-10-114-138-129
> 8 ip-10-114-138-129
> bash-3.2#
> bash-3.2# # that's correct result!
> bash-3.2#
> bash-3.2# cat /proc/sys/fs/file-nr
> 512 0 762674
> bash-3.2#
> bash-3.2# # this shows that mpirun didn't leave any
> bash-3.2# # oepn file behind, I think. That's good.
> bash-3.2#
> bash-3.2# exit
> exit
> [tsakai_at_ip-10-114-138-129 ~]$
> [tsakai_at_ip-10-114-138-129 ~]$ exit
> logout
> [tsakai_at_vixen ec2]$
>
> Had it been the case, it failed both as root and as user
> tsakai, I can conclude that either the virtualized environment
> is disagreeable with openmpi OR there is something wrong with
> what I am trying to do. But what kills me is that it *does*
> work when run by root. Why pipe system call fails on user
> tsakai and not on root is something I don't understand.
>
> BTW, here is the same test (using a single machine) in local
> environment (i.e., no virtualized environment):
>
> [tsakai_at_vixen Rmpi]$ cat app.ac2
> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5
> -H vixen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6
> [tsakai_at_vixen Rmpi]$
> [tsakai_at_vixen Rmpi]$ mpirun --app app.ac2
> 5 vixen.egcrc.org
> 8 vixen.egcrc.org
> [tsakai_at_vixen Rmpi]$
>
> I am running out of stones to turn over for now and maybe it's
> a good time to go to bed. :)
>
> I would appreciate it if you come up with a different things
> to try.
>
> Many thanks for your help.
>
> Regards,
>
> Tena
>
>
> On 2/11/11 7:45 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>
>> Hi Tena
>>
>> We setup the cluster nodes to run MPI programs
>> with stacksize unlimited,
>> memlock unlimited,
>> 4096 max open files,
>> to avoid crashing on edge cases.
>> This is kind of typical for HPC, MPI, number crunching.
>>
>> However, some are quite big codes,
>> and from what you said yours is not (or not yet).
>>
>> Your stack limit sounds quite small, but when
>> we had problems with stack the result was a segmentation fault.
>> 1024 files I guess is a default for 32 bit Linux distributions,
>> but some programs break there.
>>
>> If you want to do this, put these lines on the bottom
>> of /etc/security/limits.conf:
>>
>> # End of file
>> * - memlock -1
>> * - stack -1
>> * - nofile 4096
>>
>> I don't think you should give unlimited number of processes to
>> regular users; keep this privilege to root (which is where
>> the two have different limits).
>>
>> You may want to monitor /proc/sys/fs/file-nr while the program runs.
>> The first number is the actual number of open files.
>> Top or vmstat also help see how you are doing in terms of memory,
>> although you suggested these are (small?) test programs, unlikely to run
>> out of memory.
>>
>> If you are using two nodes, check the same stuff on the other node too.
>> Also, the IP number you checked now is not the same as in your
>> message with the MPI failure/errors.
>> Not sure if I understand which computers we're talking about,
>> or where these computers are (at Amazon?),
>> or if they change depending on each session you use to run your programs,
>> if they are identical machines with the same limits or if they differ.
>>
>> One of the error messages mentions LD_LIBRARY_PATH.
>> Is it set to point to the OpenMPI lib directory?
>> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly set.
>>
>> I hope this helps, although I am afraid I may be missing the point.
>>
>> Gus Correa
>>
>> Tena Sakai wrote:
>>> Hi Gus,
>>>
>>> Thank you for your tips.
>>>
>>> I didn't find any smoking gun or anything comes close.
>>> Here's the upshot:
>>>
>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 61504
>>> max locked memory (kbytes, -l) 32
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 1024
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) 8192
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) 61504
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>> bash-3.2#
>>> bash-3.2# ulimit -a
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 61504
>>> max locked memory (kbytes, -l) 32
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 1024
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) 8192
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) unlimited
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>> bash-3.2#
>>> bash-3.2#
>>> bash-3.2# ulimit -a > root_ulimit-a
>>> bash-3.2# exit
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>>> 14c14
>>> < max user processes (-u) unlimited
>>> ---
>>>> max user processes (-u) 61504
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
>>> /proc/sys/fs/file-max
>>> 480 0 762674
>>> 762674
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>> bash-3.2#
>>> bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>> 512 0 762674
>>> 762674
>>> bash-3.2# exit
>>> exit
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>>> -bash: sysctl: command not found
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ /sbin/!!
>>> /sbin/sysctl -a |grep fs.file-max
>>> error: permission denied on key 'kernel.cad_pid'
>>> error: permission denied on key 'kernel.cap-bound'
>>> fs.file-max = 762674
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
>>> fs.file-max = 762674
>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>
>>> I see a bit of difference between root and tsakai, but I cannot
>>> believe such small difference results in somewhat a catastrophic
>>> failure as I have reported. Would you agree with me?
>>>
>>> Regards,
>>>
>>> Tena
>>>
>>> On 2/11/11 6:06 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>
>>>> Hi Tena
>>>>
>>>> Please read one answer inline.
>>>>
>>>> Tena Sakai wrote:
>>>>> Hi Jeff,
>>>>> Hi Gus,
>>>>>
>>>>> Thanks for your replies.
>>>>>
>>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>>>>> as identical to that of root. In that setting I reproduced the
>>>>> same result as before: root can run mpirun correctly and tsakai
>>>>> cannot.
>>>>>
>>>>> I have also checked out permission on /tmp directory. tsakai has
>>>>> no problem creating files under /tmp.
>>>>>
>>>>> I am trying to come up with a strategy to show that each and every
>>>>> programs in the PATH has "world" executable permission. It is a
>>>>> stone to turn over, but I am not holding my breath.
>>>>>
>>>>>> ... you are running out of file descriptors. Are file descriptors
>>>>>> limited on a per-process basis, perchance?
>>>>> I have never heard there is such restriction on Amazon EC2. There
>>>>> are folks who keep running instances for a long, long time. Whereas
>>>>> in my case, I launch 2 instances, check things out, and then turn
>>>>> the instances off. (Given that the state of California has a huge
>>>>> debts, our funding is very tight.) So, I really doubt that's the
>>>>> case. I have run mpirun unsuccessfully as user tsakai and immediately
>>>>> after successfully as root. Still, I would be happy if you can tell
>>>>> me a way to tell number of file descriptors used or remmain.
>>>>>
>>>>> Your mentioned file descriptors made me think of something under
>>>>> /dev. But I don't know exactly what I am fishing. Do you have
>>>>> some suggestions?
>>>>>
>>>> 1) If the environment has anything to do with Linux,
>>>> check:
>>>>
>>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>>>
>>>>
>>>> or
>>>>
>>>> sysctl -a |grep fs.file-max
>>>>
>>>> This max can be set (fs.file-max=whatever_is_reasonable)
>>>> in /etc/sysctl.conf
>>>>
>>>> See 'man sysctl' and 'man sysctl.conf'
>>>>
>>>> 2) Another possible source of limits.
>>>>
>>>> Check "ulimit -a" (bash) or "limit" (tcsh).
>>>>
>>>> If you need to change look at:
>>>>
>>>> /etc/security/limits.conf
>>>>
>>>> (See also 'man limits.conf')
>>>>
>>>> **
>>>>
>>>> Since "root can but Tena cannot",
>>>> I would check 2) first,
>>>> as they are the 'per user/per group' limits,
>>>> whereas 1) is kernel/system-wise.
>>>>
>>>> I hope this helps,
>>>> Gus Correa
>>>>
>>>> PS - I know you are a wise and careful programmer,
>>>> but here we had cases of programs that would
>>>> fail because of too many files that were open and never closed,
>>>> eventually exceeding the max available/permissible.
>>>> So, it does happen.
>>>>
>>>>> I wish I could reproduce this (weired) behavior on a different
>>>>> set of machines. I certainly cannot in my local environment. Sigh!
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tena
>>>>>
>>>>>
>>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>>>>
>>>>>> It is concerning if the pipe system call fails - I can't think of why that
>>>>>> would happen. Thats not usually a permissions issue but rather a deeper
>>>>>> indication that something is either seriously wrong on your system or you
>>>>>> are
>>>>>> running out of file descriptors. Are file descriptors limited on a
>>>>>> per-process
>>>>>> basis, perchance?
>>>>>>
>>>>>> Sent from my PDA. No type good.
>>>>>>
>>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>>>>
>>>>>>> Hi Tena
>>>>>>>
>>>>>>> Since root can but you can't,
>>>>>>> is is a directory permission problem perhaps?
>>>>>>> Check the execution directory permission (on both machines,
>>>>>>> if this is not NFS mounted dir).
>>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for
>>>>>>> under-the-hood stuff, worth checking permissions there also.
>>>>>>> Just a naive guess.
>>>>>>>
>>>>>>> Congrats for all the progress with the cloudy MPI!
>>>>>>>
>>>>>>> Gus Correa
>>>>>>>
>>>>>>> Tena Sakai wrote:
>>>>>>>> Hi,
>>>>>>>> I have made a bit more progress. I think I can say ssh authenti-
>>>>>>>> cation problem is behind me now. I am still having a problem running
>>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that
>>>>>>>> I can run mpirun as root. Here's the session log:
>>>>>>>> [tsakai_at_vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>>>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll
>>>>>>>> total 8
>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll .ssh
>>>>>>>> total 16
>>>>>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys
>>>>>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config
>>>>>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>>>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>>>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # I am on machine B
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ hostname
>>>>>>>> ip-10-100-243-195
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ ll
>>>>>>>> total 8
>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ cat app.ac
>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # go back to machine A
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ exit
>>>>>>>> logout
>>>>>>>> Connection to ip-10-100-243-195.ec2.internal closed.
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ hostname
>>>>>>>> ip-10-195-198-31
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # Execute mpirun
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> mpirun was unable to launch the specified application as it encountered
>>>>>>>> an
>>>>>>>> error:
>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>> subsystem
>>>>>>>> Node: ip-10-195-198-31
>>>>>>>> while attempting to start process rank 0.
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it as root
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ sudo su
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# pwd
>>>>>>>> /home/tsakai
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# ls -l /root/.ssh/config
>>>>>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# cat /root/.ssh/config
>>>>>>>> Host *
>>>>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>>>>> IdentitiesOnly yes
>>>>>>>> BatchMode yes
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# pwd
>>>>>>>> /home/tsakai
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# ls -l
>>>>>>>> total 8
>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# # now is the time for mpirun
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# mpirun --app ./app.ac
>>>>>>>> 13 ip-10-100-243-195
>>>>>>>> 21 ip-10-100-243-195
>>>>>>>> 5 ip-10-195-198-31
>>>>>>>> 8 ip-10-195-198-31
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# # It works (being root)!
>>>>>>>> bash-3.2#
>>>>>>>> bash-3.2# exit
>>>>>>>> exit
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun --app app.ac
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> mpirun was unable to launch the specified application as it encountered
>>>>>>>> an
>>>>>>>> error:
>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>> subsystem
>>>>>>>> Node: ip-10-195-198-31
>>>>>>>> while attempting to start process rank 0.
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> --
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # I don't get it.
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ exit
>>>>>>>> logout
>>>>>>>> [tsakai_at_vixen ec2]$
>>>>>>>> So, why does it say "pipe function call failed when setting up
>>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>>>>>>>> The node it is referring to is not the remote machine. It is
>>>>>>>> What I call machine A. I first thought maybe this is a problem
>>>>>>>> With PATH variable. But I don't think so. I compared root's
>>>>>>>> Path to that of tsaki's and made them identical and retried.
>>>>>>>> I got the same behavior.
>>>>>>>> If you could enlighten me why this is happening, I would really
>>>>>>>> Appreciate it.
>>>>>>>> Thank you.
>>>>>>>> Tena
>>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>> Hi jeff,
>>>>>>>>>
>>>>>>>>> Thanks for the firewall tip. I tried it while allowing all tip traffic
>>>>>>>>> and got interesting and preplexing result. Here's what's interesting
>>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):
>>>>>>>>>
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>> Host key verification failed.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>> A daemon (pid 2743) died unexpectedly with status 255 while
>>>>>>>>> attempting
>>>>>>>>> to launch so we are aborting.
>>>>>>>>>
>>>>>>>>> There may be more information reported by the environment (see
>>>>>>>>> above).
>>>>>>>>>
>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>> shared
>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>>> have
>>>>>>>>> the
>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>> process
>>>>>>>>> that caused that situation.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ env | grep LD_LIB
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>>>>>>>> /usr/local/lib
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I better to this on machine B as well
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>>>>>>>> Warning: Identity file tsakai not accessible: No such file or
>>>>>>>>> directory.
>>>>>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ env | grep LD_LIB
>>>>>>>>> LD_LIBRARY_PATH=/usr/local/lib
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ exit
>>>>>>>>> logout
>>>>>>>>> Connection to ip-10-195-171-159 closed.
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ hostname
>>>>>>>>> ip-10-203-21-132
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # try mpirun again
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>> Host key verification failed.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>> A daemon (pid 2789) died unexpectedly with status 255 while
>>>>>>>>> attempting
>>>>>>>>> to launch so we are aborting.
>>>>>>>>>
>>>>>>>>> There may be more information reported by the environment (see
>>>>>>>>> above).
>>>>>>>>>
>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>> shared
>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>>> have
>>>>>>>>> the
>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>> process
>>>>>>>>> that caused that situation.
>>>>>>>>>
>>>>>>>>>
>>> ------------------------------------------------------------------------->>>>
>>>>>
>>> -
>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>>>>>>>> /usr/local/lib...
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>>>>>>>> total 16604
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so ->
>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 ->
>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so ->
>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 ->
>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so ->
>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 ->
>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so ->
>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 ->
>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so ->
>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 ->
>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so ->
>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 ->
>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so ->
>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 ->
>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so ->
>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 ->
>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so ->
>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 ->
>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so ->
>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 ->
>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so ->
>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 ->
>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Now, I am really confused...
>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>
>>>>>>>>> Do you know why it's complaining about shared libraries?
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> Tena
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like you
>>>>>>>>>> might
>>>>>>>>>> have firewall issues.
>>>>>>>>>>
>>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to
>>>>>>>>>> mpirun
>>>>>>>>>> for
>>>>>>>>>> various command and control reasons. If it is blocked from doing so
>>>>>>>>>> by
>>>>>>>>>> a
>>>>>>>>>> firewall, Open MPI won't run. In general, you can either disable your
>>>>>>>>>> firewall or you can setup a trust relationship for TCP connections
>>>>>>>>>> within
>>>>>>>>>> your
>>>>>>>>>> cluster.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Reuti,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete
>>>>>>>>>>> session is captured in the attached file.
>>>>>>>>>>>
>>>>>>>>>>> What I did is much similar to what I have done before: verify
>>>>>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy
>>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First
>>>>>>>>>>> from an scp invocation and then from mpirun invocation. They both
>>>>>>>>>>> say
>>>>>>>>>>> debug1: Authentication succeeded (publickey).
>>>>>>>>>>>
>>>>>>>>>>>> From mpirun invocation, I see a line:
>>>>>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca
>>>>>>>>>>> orte_ess_num_procs
>>>>>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>>>>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>>>>>>>> After that there was hanging and I controlled-C out of it, which
>>>>>>>>>>> gave me more lines. But the lines after
>>>>>>>>>>> debug1: Sending command: orted bla bla bla
>>>>>>>>>>> doesn't look good to me. But, in truth, I have no idea what they
>>>>>>>>>>> mean.
>>>>>>>>>>>
>>>>>>>>>>> If you could shed some light, I would appreciate it very much.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Tena
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>>>>>>>
>>>>>>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>>>>>>>> No, my environment is entirely linux. The path to my home
>>>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>>>>>>>> itself as /home/tsakai). For historical reasons, I have
>>>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>>>>>>>> okay. Sometimes the protection of the home directory must be
>>>>>>>>>>>> adjusted
>>>>>>>>>>>> too,
>>>>>>>>>>>> but
>>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>>>>>>> ssh-keys for each user.
>>>>>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I
>>>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names
>>>>>>>>>>>>> change from a launch to another. I am using passphrasesless
>>>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>>>>>>>> Amazon node B back to A. (Please see my initail post. There
>>>>>>>>>>>>> is a session dialogue for this.) They all work without authen-
>>>>>>>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>>>>>>>> can't be established.
>>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)?
>>>>>>>>>>>>> to which I say "yes."
>>>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>>>>>>>> Doesn't that mean with password? If so, it is not an option.
>>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each
>>>>>>>>>>>> users'
>>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But when
>>>>>>>>>>>> the
>>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It
>>>>>>>>>>>> uses
>>>>>>>>>>>> hostkeys (private and public), this way it works for all users. Just
>>>>>>>>>>>> for
>>>>>>>>>>>> reference:
>>>>>>>>>>>>
>>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>>>>>>>
>>>>>>>>>>>> You could look into it later.
>>>>>>>>>>>>
>>>>>>>>>>>> ==
>>>>>>>>>>>>
>>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh
>>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>>>>>>>
>>>>>>>>>>>> - What about putting:
>>>>>>>>>>>>
>>>>>>>>>>>> LogLevel DEBUG3
>>>>>>>>>>>>
>>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to
>>>>>>>>>>>> negotiate
>>>>>>>>>>>> before
>>>>>>>>>>>> it fails in verbose mode.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs?
>>>>>>>>>>>>> I
>>>>>>>>>>>>> saw
>>>>>>>>>>>>> the
>>>>>>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>>>>>>
>>>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks
>>>>>>>>>>>>> like:
>>>>>>>>>>>>> # machine A
>>>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>>>>>>> This is just an abbreviation or nickname above. To use the
>>>>>>>>>>>>> specified
>>>>>>>>>>>>> settings,
>>>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> anyway for all machines, you can use:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Host *
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It
>>>>>>>>>>>>> would
>>>>>>>>>>>>> also
>>>>>>>>>>>>> be
>>>>>>>>>>>>> an option to use hostbased authentication, which will avoid setting
>>>>>>>>>>>>> any
>>>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> HostName domU-12-31-39-07-35-21
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> # machine B
>>>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>>>>>> HostName domU-12-31-39-06-74-E2
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>>>>>
>>>>>>>>>>>>> It hungs. I control-C out of it and I get:
>>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>>> process
>>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>>>>>>>>> shown
>>>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> ->
>>>>>>>>> -
>>>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not
>>>>>>>>>>>>> report
>>>>>>>>>>>>> back when launched
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I making progress?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>>>>>> problem?
>>>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There
>>>>>>>>>>>>> are
>>>>>>>>>>>>> so
>>>>>>>>>>>>> many keyword-argument paris for this config file and I would like
>>>>>>>>>>>>> to
>>>>>>>>>>>>> look
>>>>>>>>>>>>> at
>>>>>>>>>>>>> some very basic one that works.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>>> tsakai_at_[hidden]
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ cat app.ac1
>>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> The program I run is
>>>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s the program fib.R:
>>>>>>>>>>>>> [ tsakai_at_vixen local]$ cat fib.R
>>>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>>>>
>>>>>>>>>>>>> fib <- function( n ) {
>>>>>>>>>>>>> a <- 0
>>>>>>>>>>>>> b <- 1
>>>>>>>>>>>>> for ( i in 1:n ) {
>>>>>>>>>>>>> t <- b
>>>>>>>>>>>>> b <- a
>>>>>>>>>>>>> a <- a + t
>>>>>>>>>>>>> }
>>>>>>>>>>>>> a
>>>>>>>>>>>>>
>>>>>>>>>>>>> arg <- commandArgs( TRUE )
>>>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>>>>> cat( fib(arg), myHost, '\n' )
>>>>>>>>>>>>>
>>>>>>>>>>>>> It reads an argument from command line and produces a fibonacci
>>>>>>>>>>>>> number
>>>>>>>>>>>>> that
>>>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty
>>>>>>>>>>>>> simple
>>>>>>>>>>>>> stuff.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s the run output:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ mpirun -app app.ac1
>>>>>>>>>>>>> 5 vixen.egcrc.org
>>>>>>>>>>>>> 8 vixen.egcrc.org
>>>>>>>>>>>>> 13 blitzen.egcrc.org
>>>>>>>>>>>>> 21 blitzen.egcrc.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which is exactly what I expect. So far so good.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ ssh –A I ~/.ssh/tsakai
>>>>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I am on machine A:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>>>>> without
>>>>>>>>>>>>> password authentication,
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
>>>>>>>>>>>>> A
>>>>>>>>>>>>> without using password
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>>>>> can't
>>>>>>>>>>>>> be established.
>>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
>>>>>>>>>>>>> list
>>>>>>>>>>>>> of
>>>>>>>>>>>>> known hosts.
>>>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>>>>> logout
>>>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>>>>> logout
>>>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you can see, neither machine uses password for authentication;
>>>>>>>>>>>>> it
>>>>>>>>>>>>> uses
>>>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for
>>>>>>>>>>>>> ssh
>>>>>>>>>>>>> invocation
>>>>>>>>>>>>> from one machine to the other. This is so because I have a copy of
>>>>>>>>>>>>> public
>>>>>>>>>>>>> key
>>>>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s what happens with mpirun:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password:
>>>>>>>>>>>>> Permission denied, please try again.
>>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ----------------------------------------------------------------------->
>>>>>>>>>>
>>>>>>>> -
>>>>>>>>>>>>> --
>>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>>> process
>>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> ----------------------------------------------------------------------->
>>>>>>>>>>
>>>>>>>> -
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>>>
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don’t have.
>>>>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s my question:
>>>>>>>>>>>>> How can I get past authentication by mpirun where there is no
>>>>>>>>>>>>> password?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>>> tsakai_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users