Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Gus Correa (gus_at_[hidden])
Date: 2011-02-14 14:15:09


Tena Sakai wrote:
> Hi Reuti,
>
>> a) can you ssh from dasher to vixen?
> Yes, no problem.
> [tsakai_at_dasher Rmpi]$
> [tsakai_at_dasher Rmpi]$ hostname
> dasher.egcrc.org
> [tsakai_at_dasher Rmpi]$
> [tsakai_at_dasher Rmpi]$ ssh vixen
> Last login: Mon Feb 14 10:39:20 2011 from dasher.egcrc.org
> [tsakai_at_vixen ~]$
> [tsakai_at_vixen ~]$ hostname
> vixen.egcrc.org
> [tsakai_at_vixen ~]$
>
>> b) firewall on vixen?
> There is no firewall on vixen that I know of, but I don't
> know how I can definitively show it one way or the other.
> Can you please suggest how I can do this?
>
> Regards,
>
> Tena
>
>

Hi Tena

Besides Reuti suggestions:

Check the consistency of /etc/hosts on both machines.
Check if there are restrictions on /etc/hosts.allow and
/etc/hosts.deny on both machines.
Check if both the MPI directories and your home/work directory
is mounted/available on both machines.
(We may have been through this checklist before, sorry if I forgot.)

Firewall info (not very friendly syntax ...):

iptables --list

or maybe better:

cat /etc/sysconfig/iptables

I hope it helps,
Gus Correa

> On 2/14/11 4:38 AM, "Reuti" <reuti_at_[hidden]> wrote:
>
>> Hi,
>>
>> Am 14.02.2011 um 04:54 schrieb Tena Sakai:
>>
>>> I have digressed and started downward descent...
>>>
>>> I was trying to make a simple and clear case. Everything
>>> I write in this very mail is about local machines. There
>>> is no virtual machines involved. I am talking about two
>>> machines, vixen and dasher, which share the same file
>>> structure. Vixen is a nfs server and dasher is an nfs
>>> client. I have just installed openmpi 1.4.3 on dasher,
>>> which is the same version I have on vixen.
>>>
>>> I have a file app.ac3, which looks like:
>>> [tsakai_at_vixen Rmpi]$ cat app.ac3
>>> -H dasher.egcrc.org -np 1 hostname
>>> -H dasher.egcrc.org -np 1 hostname
>>> -H vixen.egcrc.org -np 1 hostname
>>> -H vixen.egcrc.org -np 1 hostname
>>> [tsakai_at_vixen Rmpi]$
>>>
>>> Vixen can run this without any problem:
>>> [tsakai_at_vixen Rmpi]$ mpirun -app app.ac3
>>> vixen.egcrc.org
>>> vixen.egcrc.org
>>> dasher.egcrc.org
>>> dasher.egcrc.org
>>> [tsakai_at_vixen Rmpi]$
>>>
>>> But I can't run this very command from dasher:
>>> [tsakai_at_vixen Rmpi]$
>>> [tsakai_at_vixen Rmpi]$ ssh dasher
>>> Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org
>>> [tsakai_at_dasher ~]$
>>> [tsakai_at_dasher ~]$ cd Notes/R/parallel/Rmpi/
>>> [tsakai_at_dasher Rmpi]$
>>> [tsakai_at_dasher Rmpi]$ mpirun -app app.ac3
>>> mpirun: killing job...
>> a) can you ssh from dasher to vixen?
>>
>> b) firewall on vixen?
>>
>> -- Reuti
>>
>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --------------------------------------------------------------------------
>>> vixen.egcrc.org - daemon did not report back when launched
>>> [tsakai_at_dasher Rmpi]$
>>>
>>> After I issue the mpirun command, it hangs and I had to Cntrol-C out
>>> of it at which point it generated all lines " mpirun: killing job..."
>>> and below.
>>>
>>> A strange thing is that dahser has no problem executing the same
>>> thing via ssh:
>>> [tsakai_at_dasher Rmpi]$ ssh vixen.egcrc.org hostname
>>> vixen.egcrc.org
>>> [tsakai_at_dasher Rmpi]$
>>>
>>> In fact, dasher can run it via mpirun so long as no foreign machine
>>> is present in the app file. Ie.,
>>> [tsakai_at_dasher Rmpi]$ cat app.ac4
>>> -H dasher.egcrc.org -np 1 hostname
>>> -H dasher.egcrc.org -np 1 hostname
>>> # -H vixen.egcrc.org -np 1 hostname
>>> # -H vixen.egcrc.org -np 1 hostname
>>> [tsakai_at_dasher Rmpi]$
>>> [tsakai_at_dasher Rmpi]$ mpirun -app app.ac4
>>> dasher.egcrc.org
>>> dasher.egcrc.org
>>> [tsakai_at_dasher Rmpi]$
>>>
>>> Can you please tell me why I can go one way (from vixen to dasher)
>>> and not the other way (dasher to vixen)?
>>>
>>> Thank you.
>>>
>>> Tena
>>>
>>>
>>> On 2/12/11 9:42 PM, "Gustavo Correa" <gus_at_[hidden]> wrote:
>>>
>>>> Hi Tena
>>>>
>>>> Thank you for taking the time to explain the details of
>>>> the EC2 procedure.
>>>>
>>>> I am afraid everything in my bag of tricks was used.
>>>> As Ralph and Jeff suggested, this seems to be a very specific
>>>> problem with EC2.
>>>>
>>>> The difference in behavior when you run as root vs. when you
>>>> run as Tena, tells that there is some use restriction to regular users
>>>> in EC2 that isn't present in common machines (Linux or other), I guess.
>>>> This may be yet another 'stone to turn', as you like to say.
>>>> It also suggests that there is nothing wrong in principle with your
>>>> openMPI setup or with your program, otherwise root would not be able to run
>>>> it.
>>>>
>>>> Besides Ralph suggestion of trying the EC2 mailing list archive,
>>>> I wonder if EC2 has any type of user support where you could ask
>>>> for help.
>>>> After all, it is a paid sevice, isn't it?
>>>> (OpenMPI is not paid and has a great customer service, doesn't it? :) )
>>>> You have a well documented case to present,
>>>> and the very peculiar fact that the program fails for normal users but runs
>>>> for root.
>>>> This should help the EC2 support to start looking for a solution.
>>>>
>>>> I am running out of suggestions of what you could try on your own.
>>>> But let me try:
>>>>
>>>> 1) You may try to reduce the problem to its less common denominator,
>>>> perhaps by trying to run non-R based MPI programs on EC2, maybe the
>>>> hello_c.c,
>>>> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
>>>> This would be to avoid the extra layer of complexity introduced by R.
>>>> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2
>>>> hostname).
>>>> I.e. go in a progression of increasing complexity, see where you hit the
>>>> wall.
>>>> This may shed some light on what is going on.
>>>>
>>>> I don't know if this suggestion may really help, though.
>>>> It is not clear to me where the thing fails, whether it is during program
>>>> execution,
>>>> or while mpiexec is setting up the environment for the program to run.
>>>> If it is very early in the process, before the program starts, my suggestion
>>>> won't work.
>>>> Jeff and Ralph, who know OpenMPI inside out, may have better advice in this
>>>> regard.
>>>>
>>>> 2) Another thing would be to try to run R on E2C in serial mode, without
>>>> mpiexec,
>>>> interactively or via script, to see who EC2 doesn't like: R or OpenMPI (but
>>>> maybe it's both).
>>>>
>>>> Gus Correa
>>>>
>>>> On Feb 11, 2011, at 9:54 PM, Tena Sakai wrote:
>>>>
>>>>> Hi Gus,
>>>>>
>>>>> Thank you for your tips.
>>>>>
>>>>> I didn't find any smoking gun or anything comes close.
>>>>> Here's the upshot:
>>>>>
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a
>>>>> core file size (blocks, -c) 0
>>>>> data seg size (kbytes, -d) unlimited
>>>>> scheduling priority (-e) 0
>>>>> file size (blocks, -f) unlimited
>>>>> pending signals (-i) 61504
>>>>> max locked memory (kbytes, -l) 32
>>>>> max memory size (kbytes, -m) unlimited
>>>>> open files (-n) 1024
>>>>> pipe size (512 bytes, -p) 8
>>>>> POSIX message queues (bytes, -q) 819200
>>>>> real-time priority (-r) 0
>>>>> stack size (kbytes, -s) 8192
>>>>> cpu time (seconds, -t) unlimited
>>>>> max user processes (-u) 61504
>>>>> virtual memory (kbytes, -v) unlimited
>>>>> file locks (-x) unlimited
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>>>> bash-3.2#
>>>>> bash-3.2# ulimit -a
>>>>> core file size (blocks, -c) 0
>>>>> data seg size (kbytes, -d) unlimited
>>>>> scheduling priority (-e) 0
>>>>> file size (blocks, -f) unlimited
>>>>> pending signals (-i) 61504
>>>>> max locked memory (kbytes, -l) 32
>>>>> max memory size (kbytes, -m) unlimited
>>>>> open files (-n) 1024
>>>>> pipe size (512 bytes, -p) 8
>>>>> POSIX message queues (bytes, -q) 819200
>>>>> real-time priority (-r) 0
>>>>> stack size (kbytes, -s) 8192
>>>>> cpu time (seconds, -t) unlimited
>>>>> max user processes (-u) unlimited
>>>>> virtual memory (kbytes, -v) unlimited
>>>>> file locks (-x) unlimited
>>>>> bash-3.2#
>>>>> bash-3.2#
>>>>> bash-3.2# ulimit -a > root_ulimit-a
>>>>> bash-3.2# exit
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>>>>> 14c14
>>>>> < max user processes (-u) unlimited
>>>>> ---
>>>>>> max user processes (-u) 61504
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
>>>>> /proc/sys/fs/file-max
>>>>> 480 0 762674
>>>>> 762674
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>>>> bash-3.2#
>>>>> bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>>>> 512 0 762674
>>>>> 762674
>>>>> bash-3.2# exit
>>>>> exit
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>>>>> -bash: sysctl: command not found
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ /sbin/!!
>>>>> /sbin/sysctl -a |grep fs.file-max
>>>>> error: permission denied on key 'kernel.cad_pid'
>>>>> error: permission denied on key 'kernel.cap-bound'
>>>>> fs.file-max = 762674
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
>>>>> fs.file-max = 762674
>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>
>>>>> I see a bit of difference between root and tsakai, but I cannot
>>>>> believe such small difference results in somewhat a catastrophic
>>>>> failure as I have reported. Would you agree with me?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tena
>>>>>
>>>>> On 2/11/11 6:06 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>>>
>>>>>> Hi Tena
>>>>>>
>>>>>> Please read one answer inline.
>>>>>>
>>>>>> Tena Sakai wrote:
>>>>>>> Hi Jeff,
>>>>>>> Hi Gus,
>>>>>>>
>>>>>>> Thanks for your replies.
>>>>>>>
>>>>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>>>>>>> as identical to that of root. In that setting I reproduced the
>>>>>>> same result as before: root can run mpirun correctly and tsakai
>>>>>>> cannot.
>>>>>>>
>>>>>>> I have also checked out permission on /tmp directory. tsakai has
>>>>>>> no problem creating files under /tmp.
>>>>>>>
>>>>>>> I am trying to come up with a strategy to show that each and every
>>>>>>> programs in the PATH has "world" executable permission. It is a
>>>>>>> stone to turn over, but I am not holding my breath.
>>>>>>>
>>>>>>>> ... you are running out of file descriptors. Are file descriptors
>>>>>>>> limited on a per-process basis, perchance?
>>>>>>> I have never heard there is such restriction on Amazon EC2. There
>>>>>>> are folks who keep running instances for a long, long time. Whereas
>>>>>>> in my case, I launch 2 instances, check things out, and then turn
>>>>>>> the instances off. (Given that the state of California has a huge
>>>>>>> debts, our funding is very tight.) So, I really doubt that's the
>>>>>>> case. I have run mpirun unsuccessfully as user tsakai and immediately
>>>>>>> after successfully as root. Still, I would be happy if you can tell
>>>>>>> me a way to tell number of file descriptors used or remmain.
>>>>>>>
>>>>>>> Your mentioned file descriptors made me think of something under
>>>>>>> /dev. But I don't know exactly what I am fishing. Do you have
>>>>>>> some suggestions?
>>>>>>>
>>>>>> 1) If the environment has anything to do with Linux,
>>>>>> check:
>>>>>>
>>>>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>>>>>
>>>>>>
>>>>>> or
>>>>>>
>>>>>> sysctl -a |grep fs.file-max
>>>>>>
>>>>>> This max can be set (fs.file-max=whatever_is_reasonable)
>>>>>> in /etc/sysctl.conf
>>>>>>
>>>>>> See 'man sysctl' and 'man sysctl.conf'
>>>>>>
>>>>>> 2) Another possible source of limits.
>>>>>>
>>>>>> Check "ulimit -a" (bash) or "limit" (tcsh).
>>>>>>
>>>>>> If you need to change look at:
>>>>>>
>>>>>> /etc/security/limits.conf
>>>>>>
>>>>>> (See also 'man limits.conf')
>>>>>>
>>>>>> **
>>>>>>
>>>>>> Since "root can but Tena cannot",
>>>>>> I would check 2) first,
>>>>>> as they are the 'per user/per group' limits,
>>>>>> whereas 1) is kernel/system-wise.
>>>>>>
>>>>>> I hope this helps,
>>>>>> Gus Correa
>>>>>>
>>>>>> PS - I know you are a wise and careful programmer,
>>>>>> but here we had cases of programs that would
>>>>>> fail because of too many files that were open and never closed,
>>>>>> eventually exceeding the max available/permissible.
>>>>>> So, it does happen.
>>>>>>
>>>>>>> I wish I could reproduce this (weired) behavior on a different
>>>>>>> set of machines. I certainly cannot in my local environment. Sigh!
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tena
>>>>>>>
>>>>>>>
>>>>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> It is concerning if the pipe system call fails - I can't think of why
>>>>>>>> that
>>>>>>>> would happen. Thats not usually a permissions issue but rather a deeper
>>>>>>>> indication that something is either seriously wrong on your system or
>>>>>>>> you
>>>>>>>> are
>>>>>>>> running out of file descriptors. Are file descriptors limited on a
>>>>>>>> per-process
>>>>>>>> basis, perchance?
>>>>>>>>
>>>>>>>> Sent from my PDA. No type good.
>>>>>>>>
>>>>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <gus_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Tena
>>>>>>>>>
>>>>>>>>> Since root can but you can't,
>>>>>>>>> is is a directory permission problem perhaps?
>>>>>>>>> Check the execution directory permission (on both machines,
>>>>>>>>> if this is not NFS mounted dir).
>>>>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for
>>>>>>>>> under-the-hood stuff, worth checking permissions there also.
>>>>>>>>> Just a naive guess.
>>>>>>>>>
>>>>>>>>> Congrats for all the progress with the cloudy MPI!
>>>>>>>>>
>>>>>>>>> Gus Correa
>>>>>>>>>
>>>>>>>>> Tena Sakai wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> I have made a bit more progress. I think I can say ssh authenti-
>>>>>>>>>> cation problem is behind me now. I am still having a problem running
>>>>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that
>>>>>>>>>> I can run mpirun as root. Here's the session log:
>>>>>>>>>> [tsakai_at_vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>>>>>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll
>>>>>>>>>> total 8
>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll .ssh
>>>>>>>>>> total 16
>>>>>>>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys
>>>>>>>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config
>>>>>>>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>>>>>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>>>>>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # I am on machine B
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ hostname
>>>>>>>>>> ip-10-100-243-195
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ ll
>>>>>>>>>> total 8
>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ cat app.ac
>>>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # go back to machine A
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ exit
>>>>>>>>>> logout
>>>>>>>>>> Connection to ip-10-100-243-195.ec2.internal closed.
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ hostname
>>>>>>>>>> ip-10-195-198-31
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # Execute mpirun
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> mpirun was unable to launch the specified application as it
>>>>>>>>>> encountered
>>>>>>>>>> an
>>>>>>>>>> error:
>>>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>>>> subsystem
>>>>>>>>>> Node: ip-10-195-198-31
>>>>>>>>>> while attempting to start process rank 0.
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it as root
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ sudo su
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# pwd
>>>>>>>>>> /home/tsakai
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# ls -l /root/.ssh/config
>>>>>>>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# cat /root/.ssh/config
>>>>>>>>>> Host *
>>>>>>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>> BatchMode yes
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# pwd
>>>>>>>>>> /home/tsakai
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# ls -l
>>>>>>>>>> total 8
>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# # now is the time for mpirun
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# mpirun --app ./app.ac
>>>>>>>>>> 13 ip-10-100-243-195
>>>>>>>>>> 21 ip-10-100-243-195
>>>>>>>>>> 5 ip-10-195-198-31
>>>>>>>>>> 8 ip-10-195-198-31
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# # It works (being root)!
>>>>>>>>>> bash-3.2#
>>>>>>>>>> bash-3.2# exit
>>>>>>>>>> exit
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun --app app.ac
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> mpirun was unable to launch the specified application as it
>>>>>>>>>> encountered
>>>>>>>>>> an
>>>>>>>>>> error:
>>>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>>>> subsystem
>>>>>>>>>> Node: ip-10-195-198-31
>>>>>>>>>> while attempting to start process rank 0.
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # I don't get it.
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ exit
>>>>>>>>>> logout
>>>>>>>>>> [tsakai_at_vixen ec2]$
>>>>>>>>>> So, why does it say "pipe function call failed when setting up
>>>>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>>>>>>>>>> The node it is referring to is not the remote machine. It is
>>>>>>>>>> What I call machine A. I first thought maybe this is a problem
>>>>>>>>>> With PATH variable. But I don't think so. I compared root's
>>>>>>>>>> Path to that of tsaki's and made them identical and retried.
>>>>>>>>>> I got the same behavior.
>>>>>>>>>> If you could enlighten me why this is happening, I would really
>>>>>>>>>> Appreciate it.
>>>>>>>>>> Thank you.
>>>>>>>>>> Tena
>>>>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>>> Hi jeff,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the firewall tip. I tried it while allowing all tip
>>>>>>>>>>> traffic
>>>>>>>>>>> and got interesting and preplexing result. Here's what's interesting
>>>>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):
>>>>>>>>>>>
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>>>> Host key verification failed.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>> A daemon (pid 2743) died unexpectedly with status 255 while
>>>>>>>>>>> attempting
>>>>>>>>>>> to launch so we are aborting.
>>>>>>>>>>>
>>>>>>>>>>> There may be more information reported by the environment (see
>>>>>>>>>>> above).
>>>>>>>>>>>
>>>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>>>> shared
>>>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>>>>> have
>>>>>>>>>>> the
>>>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>> process
>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ env | grep LD_LIB
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>>>>>>>>>> /usr/local/lib
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I better to this on machine B as well
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>>>>>>>>>> Warning: Identity file tsakai not accessible: No such file or
>>>>>>>>>>> directory.
>>>>>>>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ env | grep LD_LIB
>>>>>>>>>>> LD_LIBRARY_PATH=/usr/local/lib
>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ exit
>>>>>>>>>>> logout
>>>>>>>>>>> Connection to ip-10-195-171-159 closed.
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ hostname
>>>>>>>>>>> ip-10-203-21-132
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # try mpirun again
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>>>> Host key verification failed.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>> A daemon (pid 2789) died unexpectedly with status 255 while
>>>>>>>>>>> attempting
>>>>>>>>>>> to launch so we are aborting.
>>>>>>>>>>>
>>>>>>>>>>> There may be more information reported by the environment (see
>>>>>>>>>>> above).
>>>>>>>>>>>
>>>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>>>> shared
>>>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>>>>> have
>>>>>>>>>>> the
>>>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>> process
>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> ------------------------------------------------------------------------->>
>>>>>>>
>>>>> -
>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>>>>>>>>>> /usr/local/lib...
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>>>>>>>>>> total 16604
>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so ->
>>>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 ->
>>>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so ->
>>>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 ->
>>>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so ->
>>>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 ->
>>>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so ->
>>>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 ->
>>>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so ->
>>>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 ->
>>>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so ->
>>>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 ->
>>>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so ->
>>>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 ->
>>>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so ->
>>>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 ->
>>>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so ->
>>>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 ->
>>>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so ->
>>>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 ->
>>>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so ->
>>>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 ->
>>>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Now, I am really confused...
>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>
>>>>>>>>>>> Do you know why it's complaining about shared libraries?
>>>>>>>>>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>> Tena
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like you
>>>>>>>>>>>> might
>>>>>>>>>>>> have firewall issues.
>>>>>>>>>>>>
>>>>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to
>>>>>>>>>>>> mpirun
>>>>>>>>>>>> for
>>>>>>>>>>>> various command and control reasons. If it is blocked from doing so
>>>>>>>>>>>> by
>>>>>>>>>>>> a
>>>>>>>>>>>> firewall, Open MPI won't run. In general, you can either disable
>>>>>>>>>>>> your
>>>>>>>>>>>> firewall or you can setup a trust relationship for TCP connections
>>>>>>>>>>>> within
>>>>>>>>>>>> your
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Reuti,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete
>>>>>>>>>>>>> session is captured in the attached file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What I did is much similar to what I have done before: verify
>>>>>>>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy
>>>>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First
>>>>>>>>>>>>> from an scp invocation and then from mpirun invocation. They both
>>>>>>>>>>>>> say
>>>>>>>>>>>>> debug1: Authentication succeeded (publickey).
>>>>>>>>>>>>>
>>>>>>>>>>>>> From mpirun invocation, I see a line:
>>>>>>>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>>>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca
>>>>>>>>>>>>> orte_ess_num_procs
>>>>>>>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>>>>>>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>>>>>>>>>> After that there was hanging and I controlled-C out of it, which
>>>>>>>>>>>>> gave me more lines. But the lines after
>>>>>>>>>>>>> debug1: Sending command: orted bla bla bla
>>>>>>>>>>>>> doesn't look good to me. But, in truth, I have no idea what they
>>>>>>>>>>>>> mean.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you could shed some light, I would appreciate it very much.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>>>>>>>>
>>>>>>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>>>>>>>> No, my environment is entirely linux. The path to my home
>>>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>>>>>>>> itself as /home/tsakai). For historical reasons, I have
>>>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>>>>>>>>> okay. Sometimes the protection of the home directory must be
>>>>>>>>>>>>> adjusted
>>>>>>>>>>>>> too,
>>>>>>>>>>>>> but
>>>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>>>>>>> ssh-keys for each user.
>>>>>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I
>>>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names
>>>>>>>>>>>>> change from a launch to another. I am using passphrasesless
>>>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>>>>>>>> Amazon node B back to A. (Please see my initail post. There
>>>>>>>>>>>>> is a session dialogue for this.) They all work without authen-
>>>>>>>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>>>>>>>> can't be established.
>>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)?
>>>>>>>>>>>>> to which I say "yes."
>>>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>>>>>>>> Doesn't that mean with password? If so, it is not an option.
>>>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each
>>>>>>>>>>>>> users'
>>>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But
>>>>>>>>>>>>> when
>>>>>>>>>>>>> the
>>>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It
>>>>>>>>>>>>> uses
>>>>>>>>>>>>> hostkeys (private and public), this way it works for all users.
>>>>>>>>>>>>> Just
>>>>>>>>>>>>> for
>>>>>>>>>>>>> reference:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> You could look into it later.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ==
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g.
>>>>>>>>>>>>> ssh
>>>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>>>>>>>>
>>>>>>>>>>>>> - What about putting:
>>>>>>>>>>>>>
>>>>>>>>>>>>> LogLevel DEBUG3
>>>>>>>>>>>>>
>>>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to
>>>>>>>>>>>>> negotiate
>>>>>>>>>>>>> before
>>>>>>>>>>>>> it fails in verbose mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs?
>>>>>>>>>>>>> I
>>>>>>>>>>>>> saw
>>>>>>>>>>>>> the
>>>>>>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>>>>>>
>>>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks
>>>>>>>>>>>>> like:
>>>>>>>>>>>>> # machine A
>>>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>>>>>>> This is just an abbreviation or nickname above. To use the
>>>>>>>>>>>>> specified
>>>>>>>>>>>>> settings,
>>>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> anyway for all machines, you can use:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Host *
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It
>>>>>>>>>>>>> would
>>>>>>>>>>>>> also
>>>>>>>>>>>>> be
>>>>>>>>>>>>> an option to use hostbased authentication, which will avoid setting
>>>>>>>>>>>>> any
>>>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> HostName domU-12-31-39-07-35-21
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> # machine B
>>>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>>>>>> HostName domU-12-31-39-06-74-E2
>>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>>>>>
>>>>>>>>>>>>> It hungs. I control-C out of it and I get:
>>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> ->
>>>>>>>>>>> -
>>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>>> process
>>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> ->
>>>>>>>>>>> -
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> ->
>>>>>>>>>>> -
>>>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>>>>>>>>> shown
>>>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> --
>>>>>>>>>> ->
>>>>>>>>>>> -
>>>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not
>>>>>>>>>>>>> report
>>>>>>>>>>>>> back when launched
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I making progress?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>>>>>> problem?
>>>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There
>>>>>>>>>>>>> are
>>>>>>>>>>>>> so
>>>>>>>>>>>>> many keyword-argument paris for this config file and I would like
>>>>>>>>>>>>> to
>>>>>>>>>>>>> look
>>>>>>>>>>>>> at
>>>>>>>>>>>>> some very basic one that works.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>>> tsakai_at_[hidden]
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ cat app.ac1
>>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> The program I run is
>>>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s the program fib.R:
>>>>>>>>>>>>> [ tsakai_at_vixen local]$ cat fib.R
>>>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>>>>
>>>>>>>>>>>>> fib <- function( n ) {
>>>>>>>>>>>>> a <- 0
>>>>>>>>>>>>> b <- 1
>>>>>>>>>>>>> for ( i in 1:n ) {
>>>>>>>>>>>>> t <- b
>>>>>>>>>>>>> b <- a
>>>>>>>>>>>>> a <- a + t
>>>>>>>>>>>>> }
>>>>>>>>>>>>> a
>>>>>>>>>>>>>
>>>>>>>>>>>>> arg <- commandArgs( TRUE )
>>>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>>>>> cat( fib(arg), myHost, '\n' )
>>>>>>>>>>>>>
>>>>>>>>>>>>> It reads an argument from command line and produces a fibonacci
>>>>>>>>>>>>> number
>>>>>>>>>>>>> that
>>>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty
>>>>>>>>>>>>> simple
>>>>>>>>>>>>> stuff.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s the run output:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ mpirun -app app.ac1
>>>>>>>>>>>>> 5 vixen.egcrc.org
>>>>>>>>>>>>> 8 vixen.egcrc.org
>>>>>>>>>>>>> 13 blitzen.egcrc.org
>>>>>>>>>>>>> 21 blitzen.egcrc.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which is exactly what I expect. So far so good.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>>>>> [tsakai_at_vixen local]$ ssh ­A I ~/.ssh/tsakai
>>>>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I am on machine A:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>>>>> without
>>>>>>>>>>>>> password authentication,
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
>>>>>>>>>>>>> A
>>>>>>>>>>>>> without using password
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>>>>> can't
>>>>>>>>>>>>> be established.
>>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
>>>>>>>>>>>>> list
>>>>>>>>>>>>> of
>>>>>>>>>>>>> known hosts.
>>>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>>>>> logout
>>>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>>>>> logout
>>>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you can see, neither machine uses password for authentication;
>>>>>>>>>>>>> it
>>>>>>>>>>>>> uses
>>>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for
>>>>>>>>>>>>> ssh
>>>>>>>>>>>>> invocation
>>>>>>>>>>>>> from one machine to the other. This is so because I have a copy of
>>>>>>>>>>>>> public
>>>>>>>>>>>>> key
>>>>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s what happens with mpirun:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password:
>>>>>>>>>>>>> Permission denied, please try again.
>>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> ->
>>>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>>>>> --
>>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>>> process
>>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>> ->
>>>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>>>
>>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have.
>>>>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here¹s my question:
>>>>>>>>>>>>> How can I get past authentication by mpirun where there is no
>>>>>>>>>>>>> password?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>>> tsakai_at_[hidden]