Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Tena Sakai (tsakai_at_[hidden])
Date: 2011-02-14 16:10:13


Hi Gus,

Thank you for your response.

I have verified that
 1) /etc/hosts files on both machines vixen and dasher are identical
 2) both machines have nothing but comments in hosts.allow and hosts.deny
Regarding firewall, they are different:
On vixen this how it looks:
  [root_at_vixen ec2]# cat /etc/sysconfig/iptables
  cat: /etc/sysconfig/iptables: No such file or directory
  [root_at_vixen ec2]#
  [root_at_vixen ec2]# /sbin/iptables --list
  Chain INPUT (policy ACCEPT)
  target prot opt source destination

  Chain FORWARD (policy ACCEPT)
  target prot opt source destination

  Chain OUTPUT (policy ACCEPT)
  target prot opt source destination
  [root_at_vixen ec2]#

On dasher:
  [tsakai_at_dasher Rmpi]$ sudo cat /etc/sysconfig/iptables
  # Firewall configuration written by system-config-securitylevel
  # Manual customization of this file is not recommended.
  *filter
  :INPUT ACCEPT [0:0]
  :FORWARD ACCEPT [0:0]
  :OUTPUT ACCEPT [0:0]
  :RH-Firewall-1-INPUT - [0:0]
  -A INPUT -j RH-Firewall-1-INPUT
  -A FORWARD -j RH-Firewall-1-INPUT
  -A RH-Firewall-1-INPUT -i lo -j ACCEPT
  -A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT
  -A RH-Firewall-1-INPUT -p 50 -j ACCEPT
  -A RH-Firewall-1-INPUT -p 51 -j ACCEPT
  -A RH-Firewall-1-INPUT -p udp --dport 5353 -d 224.0.0.251 -j ACCEPT
  -A RH-Firewall-1-INPUT -p udp -m udp --dport 631 -j ACCEPT
  -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 631 -j ACCEPT
  -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
  -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j
ACCEPT
  -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j
ACCEPT
  -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
  COMMIT
  [tsakai_at_dasher Rmpi]$
  [tsakai_at_dasher Rmpi]$ sudo /sbin/iptables --list
  [sudo] password for tsakai:
  Chain INPUT (policy ACCEPT)
  target prot opt source destination
  RH-Firewall-1-INPUT all -- anywhere anywhere

  Chain FORWARD (policy ACCEPT)
  target prot opt source destination
  RH-Firewall-1-INPUT all -- anywhere anywhere

  Chain OUTPUT (policy ACCEPT)
  target prot opt source destination

  Chain RH-Firewall-1-INPUT (2 references)
  target prot opt source destination
  ACCEPT all -- anywhere anywhere
  ACCEPT icmp -- anywhere anywhere icmp any
  ACCEPT esp -- anywhere anywhere
  ACCEPT ah -- anywhere anywhere
  ACCEPT udp -- anywhere 224.0.0.251 udp dpt:mdns
  ACCEPT udp -- anywhere anywhere udp dpt:ipp
  ACCEPT tcp -- anywhere anywhere tcp dpt:ipp
  ACCEPT all -- anywhere anywhere state
RELATED,ESTABLISHED
  ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:ssh
  ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:http
  REJECT all -- anywhere anywhere reject-with
icmp-host-prohibited
  [tsakai_at_dasher Rmpi]$

I don't understand what they mean. Can you see any clue as to
why vixen can and dasher cannot run mpirun with the app file:
  -H dasher.egcrc.org -np 1 hostname
  -H dasher.egcrc.org -np 1 hostname
  -H vixen.egcrc.org -np 1 hostname
  -H vixen.egcrc.org -np 1 hostname

Many thanks.

Tena

On 2/14/11 11:15 AM, "Gus Correa" <gus_at_[hidden]> wrote:

> Tena Sakai wrote:
>> Hi Reuti,
>>
>>> a) can you ssh from dasher to vixen?
>> Yes, no problem.
>> [tsakai_at_dasher Rmpi]$
>> [tsakai_at_dasher Rmpi]$ hostname
>> dasher.egcrc.org
>> [tsakai_at_dasher Rmpi]$
>> [tsakai_at_dasher Rmpi]$ ssh vixen
>> Last login: Mon Feb 14 10:39:20 2011 from dasher.egcrc.org
>> [tsakai_at_vixen ~]$
>> [tsakai_at_vixen ~]$ hostname
>> vixen.egcrc.org
>> [tsakai_at_vixen ~]$
>>
>>> b) firewall on vixen?
>> There is no firewall on vixen that I know of, but I don't
>> know how I can definitively show it one way or the other.
>> Can you please suggest how I can do this?
>>
>> Regards,
>>
>> Tena
>>
>>
>
> Hi Tena
>
> Besides Reuti suggestions:
>
> Check the consistency of /etc/hosts on both machines.
> Check if there are restrictions on /etc/hosts.allow and
> /etc/hosts.deny on both machines.
> Check if both the MPI directories and your home/work directory
> is mounted/available on both machines.
> (We may have been through this checklist before, sorry if I forgot.)
>
> Firewall info (not very friendly syntax ...):
>
> iptables --list
>
> or maybe better:
>
> cat /etc/sysconfig/iptables
>
> I hope it helps,
> Gus Correa
>
>> On 2/14/11 4:38 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>
>>> Hi,
>>>
>>> Am 14.02.2011 um 04:54 schrieb Tena Sakai:
>>>
>>>> I have digressed and started downward descent...
>>>>
>>>> I was trying to make a simple and clear case. Everything
>>>> I write in this very mail is about local machines. There
>>>> is no virtual machines involved. I am talking about two
>>>> machines, vixen and dasher, which share the same file
>>>> structure. Vixen is a nfs server and dasher is an nfs
>>>> client. I have just installed openmpi 1.4.3 on dasher,
>>>> which is the same version I have on vixen.
>>>>
>>>> I have a file app.ac3, which looks like:
>>>> [tsakai_at_vixen Rmpi]$ cat app.ac3
>>>> -H dasher.egcrc.org -np 1 hostname
>>>> -H dasher.egcrc.org -np 1 hostname
>>>> -H vixen.egcrc.org -np 1 hostname
>>>> -H vixen.egcrc.org -np 1 hostname
>>>> [tsakai_at_vixen Rmpi]$
>>>>
>>>> Vixen can run this without any problem:
>>>> [tsakai_at_vixen Rmpi]$ mpirun -app app.ac3
>>>> vixen.egcrc.org
>>>> vixen.egcrc.org
>>>> dasher.egcrc.org
>>>> dasher.egcrc.org
>>>> [tsakai_at_vixen Rmpi]$
>>>>
>>>> But I can't run this very command from dasher:
>>>> [tsakai_at_vixen Rmpi]$
>>>> [tsakai_at_vixen Rmpi]$ ssh dasher
>>>> Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org
>>>> [tsakai_at_dasher ~]$
>>>> [tsakai_at_dasher ~]$ cd Notes/R/parallel/Rmpi/
>>>> [tsakai_at_dasher Rmpi]$
>>>> [tsakai_at_dasher Rmpi]$ mpirun -app app.ac3
>>>> mpirun: killing job...
>>> a) can you ssh from dasher to vixen?
>>>
>>> b) firewall on vixen?
>>>
>>> -- Reuti
>>>
>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>> below. Additional manual cleanup may be required - please refer to
>>>> the "orte-clean" tool for assistance.
>>>> --------------------------------------------------------------------------
>>>> vixen.egcrc.org - daemon did not report back when launched
>>>> [tsakai_at_dasher Rmpi]$
>>>>
>>>> After I issue the mpirun command, it hangs and I had to Cntrol-C out
>>>> of it at which point it generated all lines " mpirun: killing job..."
>>>> and below.
>>>>
>>>> A strange thing is that dahser has no problem executing the same
>>>> thing via ssh:
>>>> [tsakai_at_dasher Rmpi]$ ssh vixen.egcrc.org hostname
>>>> vixen.egcrc.org
>>>> [tsakai_at_dasher Rmpi]$
>>>>
>>>> In fact, dasher can run it via mpirun so long as no foreign machine
>>>> is present in the app file. Ie.,
>>>> [tsakai_at_dasher Rmpi]$ cat app.ac4
>>>> -H dasher.egcrc.org -np 1 hostname
>>>> -H dasher.egcrc.org -np 1 hostname
>>>> # -H vixen.egcrc.org -np 1 hostname
>>>> # -H vixen.egcrc.org -np 1 hostname
>>>> [tsakai_at_dasher Rmpi]$
>>>> [tsakai_at_dasher Rmpi]$ mpirun -app app.ac4
>>>> dasher.egcrc.org
>>>> dasher.egcrc.org
>>>> [tsakai_at_dasher Rmpi]$
>>>>
>>>> Can you please tell me why I can go one way (from vixen to dasher)
>>>> and not the other way (dasher to vixen)?
>>>>
>>>> Thank you.
>>>>
>>>> Tena
>>>>
>>>>
>>>> On 2/12/11 9:42 PM, "Gustavo Correa" <gus_at_[hidden]> wrote:
>>>>
>>>>> Hi Tena
>>>>>
>>>>> Thank you for taking the time to explain the details of
>>>>> the EC2 procedure.
>>>>>
>>>>> I am afraid everything in my bag of tricks was used.
>>>>> As Ralph and Jeff suggested, this seems to be a very specific
>>>>> problem with EC2.
>>>>>
>>>>> The difference in behavior when you run as root vs. when you
>>>>> run as Tena, tells that there is some use restriction to regular users
>>>>> in EC2 that isn't present in common machines (Linux or other), I guess.
>>>>> This may be yet another 'stone to turn', as you like to say.
>>>>> It also suggests that there is nothing wrong in principle with your
>>>>> openMPI setup or with your program, otherwise root would not be able to
>>>>> run
>>>>> it.
>>>>>
>>>>> Besides Ralph suggestion of trying the EC2 mailing list archive,
>>>>> I wonder if EC2 has any type of user support where you could ask
>>>>> for help.
>>>>> After all, it is a paid sevice, isn't it?
>>>>> (OpenMPI is not paid and has a great customer service, doesn't it? :) )
>>>>> You have a well documented case to present,
>>>>> and the very peculiar fact that the program fails for normal users but
>>>>> runs
>>>>> for root.
>>>>> This should help the EC2 support to start looking for a solution.
>>>>>
>>>>> I am running out of suggestions of what you could try on your own.
>>>>> But let me try:
>>>>>
>>>>> 1) You may try to reduce the problem to its less common denominator,
>>>>> perhaps by trying to run non-R based MPI programs on EC2, maybe the
>>>>> hello_c.c,
>>>>> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
>>>>> This would be to avoid the extra layer of complexity introduced by R.
>>>>> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2
>>>>> hostname).
>>>>> I.e. go in a progression of increasing complexity, see where you hit the
>>>>> wall.
>>>>> This may shed some light on what is going on.
>>>>>
>>>>> I don't know if this suggestion may really help, though.
>>>>> It is not clear to me where the thing fails, whether it is during program
>>>>> execution,
>>>>> or while mpiexec is setting up the environment for the program to run.
>>>>> If it is very early in the process, before the program starts, my
>>>>> suggestion
>>>>> won't work.
>>>>> Jeff and Ralph, who know OpenMPI inside out, may have better advice in
>>>>> this
>>>>> regard.
>>>>>
>>>>> 2) Another thing would be to try to run R on E2C in serial mode, without
>>>>> mpiexec,
>>>>> interactively or via script, to see who EC2 doesn't like: R or OpenMPI
>>>>> (but
>>>>> maybe it's both).
>>>>>
>>>>> Gus Correa
>>>>>
>>>>> On Feb 11, 2011, at 9:54 PM, Tena Sakai wrote:
>>>>>
>>>>>> Hi Gus,
>>>>>>
>>>>>> Thank you for your tips.
>>>>>>
>>>>>> I didn't find any smoking gun or anything comes close.
>>>>>> Here's the upshot:
>>>>>>
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a
>>>>>> core file size (blocks, -c) 0
>>>>>> data seg size (kbytes, -d) unlimited
>>>>>> scheduling priority (-e) 0
>>>>>> file size (blocks, -f) unlimited
>>>>>> pending signals (-i) 61504
>>>>>> max locked memory (kbytes, -l) 32
>>>>>> max memory size (kbytes, -m) unlimited
>>>>>> open files (-n) 1024
>>>>>> pipe size (512 bytes, -p) 8
>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>> real-time priority (-r) 0
>>>>>> stack size (kbytes, -s) 8192
>>>>>> cpu time (seconds, -t) unlimited
>>>>>> max user processes (-u) 61504
>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>> file locks (-x) unlimited
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>>>>> bash-3.2#
>>>>>> bash-3.2# ulimit -a
>>>>>> core file size (blocks, -c) 0
>>>>>> data seg size (kbytes, -d) unlimited
>>>>>> scheduling priority (-e) 0
>>>>>> file size (blocks, -f) unlimited
>>>>>> pending signals (-i) 61504
>>>>>> max locked memory (kbytes, -l) 32
>>>>>> max memory size (kbytes, -m) unlimited
>>>>>> open files (-n) 1024
>>>>>> pipe size (512 bytes, -p) 8
>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>> real-time priority (-r) 0
>>>>>> stack size (kbytes, -s) 8192
>>>>>> cpu time (seconds, -t) unlimited
>>>>>> max user processes (-u) unlimited
>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>> file locks (-x) unlimited
>>>>>> bash-3.2#
>>>>>> bash-3.2#
>>>>>> bash-3.2# ulimit -a > root_ulimit-a
>>>>>> bash-3.2# exit
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>>>>>> 14c14
>>>>>> < max user processes (-u) unlimited
>>>>>> ---
>>>>>>> max user processes (-u) 61504
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
>>>>>> /proc/sys/fs/file-max
>>>>>> 480 0 762674
>>>>>> 762674
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo su
>>>>>> bash-3.2#
>>>>>> bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>>>>> 512 0 762674
>>>>>> 762674
>>>>>> bash-3.2# exit
>>>>>> exit
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>>>>>> -bash: sysctl: command not found
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ /sbin/!!
>>>>>> /sbin/sysctl -a |grep fs.file-max
>>>>>> error: permission denied on key 'kernel.cad_pid'
>>>>>> error: permission denied on key 'kernel.cap-bound'
>>>>>> fs.file-max = 762674
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
>>>>>> fs.file-max = 762674
>>>>>> [tsakai_at_ip-10-114-239-188 ~]$
>>>>>>
>>>>>> I see a bit of difference between root and tsakai, but I cannot
>>>>>> believe such small difference results in somewhat a catastrophic
>>>>>> failure as I have reported. Would you agree with me?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tena
>>>>>>
>>>>>> On 2/11/11 6:06 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>>>>
>>>>>>> Hi Tena
>>>>>>>
>>>>>>> Please read one answer inline.
>>>>>>>
>>>>>>> Tena Sakai wrote:
>>>>>>>> Hi Jeff,
>>>>>>>> Hi Gus,
>>>>>>>>
>>>>>>>> Thanks for your replies.
>>>>>>>>
>>>>>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>>>>>>>> as identical to that of root. In that setting I reproduced the
>>>>>>>> same result as before: root can run mpirun correctly and tsakai
>>>>>>>> cannot.
>>>>>>>>
>>>>>>>> I have also checked out permission on /tmp directory. tsakai has
>>>>>>>> no problem creating files under /tmp.
>>>>>>>>
>>>>>>>> I am trying to come up with a strategy to show that each and every
>>>>>>>> programs in the PATH has "world" executable permission. It is a
>>>>>>>> stone to turn over, but I am not holding my breath.
>>>>>>>>
>>>>>>>>> ... you are running out of file descriptors. Are file descriptors
>>>>>>>>> limited on a per-process basis, perchance?
>>>>>>>> I have never heard there is such restriction on Amazon EC2. There
>>>>>>>> are folks who keep running instances for a long, long time. Whereas
>>>>>>>> in my case, I launch 2 instances, check things out, and then turn
>>>>>>>> the instances off. (Given that the state of California has a huge
>>>>>>>> debts, our funding is very tight.) So, I really doubt that's the
>>>>>>>> case. I have run mpirun unsuccessfully as user tsakai and immediately
>>>>>>>> after successfully as root. Still, I would be happy if you can tell
>>>>>>>> me a way to tell number of file descriptors used or remmain.
>>>>>>>>
>>>>>>>> Your mentioned file descriptors made me think of something under
>>>>>>>> /dev. But I don't know exactly what I am fishing. Do you have
>>>>>>>> some suggestions?
>>>>>>>>
>>>>>>> 1) If the environment has anything to do with Linux,
>>>>>>> check:
>>>>>>>
>>>>>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>>>>>>>
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>> sysctl -a |grep fs.file-max
>>>>>>>
>>>>>>> This max can be set (fs.file-max=whatever_is_reasonable)
>>>>>>> in /etc/sysctl.conf
>>>>>>>
>>>>>>> See 'man sysctl' and 'man sysctl.conf'
>>>>>>>
>>>>>>> 2) Another possible source of limits.
>>>>>>>
>>>>>>> Check "ulimit -a" (bash) or "limit" (tcsh).
>>>>>>>
>>>>>>> If you need to change look at:
>>>>>>>
>>>>>>> /etc/security/limits.conf
>>>>>>>
>>>>>>> (See also 'man limits.conf')
>>>>>>>
>>>>>>> **
>>>>>>>
>>>>>>> Since "root can but Tena cannot",
>>>>>>> I would check 2) first,
>>>>>>> as they are the 'per user/per group' limits,
>>>>>>> whereas 1) is kernel/system-wise.
>>>>>>>
>>>>>>> I hope this helps,
>>>>>>> Gus Correa
>>>>>>>
>>>>>>> PS - I know you are a wise and careful programmer,
>>>>>>> but here we had cases of programs that would
>>>>>>> fail because of too many files that were open and never closed,
>>>>>>> eventually exceeding the max available/permissible.
>>>>>>> So, it does happen.
>>>>>>>
>>>>>>>> I wish I could reproduce this (weired) behavior on a different
>>>>>>>> set of machines. I certainly cannot in my local environment. Sigh!
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Tena
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> It is concerning if the pipe system call fails - I can't think of why
>>>>>>>>> that
>>>>>>>>> would happen. Thats not usually a permissions issue but rather a
>>>>>>>>> deeper
>>>>>>>>> indication that something is either seriously wrong on your system or
>>>>>>>>> you
>>>>>>>>> are
>>>>>>>>> running out of file descriptors. Are file descriptors limited on a
>>>>>>>>> per-process
>>>>>>>>> basis, perchance?
>>>>>>>>>
>>>>>>>>> Sent from my PDA. No type good.
>>>>>>>>>
>>>>>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <gus_at_[hidden]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Tena
>>>>>>>>>>
>>>>>>>>>> Since root can but you can't,
>>>>>>>>>> is is a directory permission problem perhaps?
>>>>>>>>>> Check the execution directory permission (on both machines,
>>>>>>>>>> if this is not NFS mounted dir).
>>>>>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for
>>>>>>>>>> under-the-hood stuff, worth checking permissions there also.
>>>>>>>>>> Just a naive guess.
>>>>>>>>>>
>>>>>>>>>> Congrats for all the progress with the cloudy MPI!
>>>>>>>>>>
>>>>>>>>>> Gus Correa
>>>>>>>>>>
>>>>>>>>>> Tena Sakai wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> I have made a bit more progress. I think I can say ssh authenti-
>>>>>>>>>>> cation problem is behind me now. I am still having a problem
>>>>>>>>>>> running
>>>>>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that
>>>>>>>>>>> I can run mpirun as root. Here's the session log:
>>>>>>>>>>> [tsakai_at_vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>>>>>>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll
>>>>>>>>>>> total 8
>>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ll .ssh
>>>>>>>>>>> total 16
>>>>>>>>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys
>>>>>>>>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config
>>>>>>>>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>>>>>>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>>>>>>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # I am on machine B
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ hostname
>>>>>>>>>>> ip-10-100-243-195
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ ll
>>>>>>>>>>> total 8
>>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ cat app.ac
>>>>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ # go back to machine A
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-100-243-195 ~]$ exit
>>>>>>>>>>> logout
>>>>>>>>>>> Connection to ip-10-100-243-195.ec2.internal closed.
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ hostname
>>>>>>>>>>> ip-10-195-198-31
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # Execute mpirun
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> mpirun was unable to launch the specified application as it
>>>>>>>>>>> encountered
>>>>>>>>>>> an
>>>>>>>>>>> error:
>>>>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>>>>> subsystem
>>>>>>>>>>> Node: ip-10-195-198-31
>>>>>>>>>>> while attempting to start process rank 0.
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it as root
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ sudo su
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# pwd
>>>>>>>>>>> /home/tsakai
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# ls -l /root/.ssh/config
>>>>>>>>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# cat /root/.ssh/config
>>>>>>>>>>> Host *
>>>>>>>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>> BatchMode yes
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# pwd
>>>>>>>>>>> /home/tsakai
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# ls -l
>>>>>>>>>>> total 8
>>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# # now is the time for mpirun
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# mpirun --app ./app.ac
>>>>>>>>>>> 13 ip-10-100-243-195
>>>>>>>>>>> 21 ip-10-100-243-195
>>>>>>>>>>> 5 ip-10-195-198-31
>>>>>>>>>>> 8 ip-10-195-198-31
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# # It works (being root)!
>>>>>>>>>>> bash-3.2#
>>>>>>>>>>> bash-3.2# exit
>>>>>>>>>>> exit
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ mpirun --app app.ac
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> mpirun was unable to launch the specified application as it
>>>>>>>>>>> encountered
>>>>>>>>>>> an
>>>>>>>>>>> error:
>>>>>>>>>>> Error: pipe function call failed when setting up I/O forwarding
>>>>>>>>>>> subsystem
>>>>>>>>>>> Node: ip-10-195-198-31
>>>>>>>>>>> while attempting to start process rank 0.
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ # I don't get it.
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$
>>>>>>>>>>> [tsakai_at_ip-10-195-198-31 ~]$ exit
>>>>>>>>>>> logout
>>>>>>>>>>> [tsakai_at_vixen ec2]$
>>>>>>>>>>> So, why does it say "pipe function call failed when setting up
>>>>>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ?
>>>>>>>>>>> The node it is referring to is not the remote machine. It is
>>>>>>>>>>> What I call machine A. I first thought maybe this is a problem
>>>>>>>>>>> With PATH variable. But I don't think so. I compared root's
>>>>>>>>>>> Path to that of tsaki's and made them identical and retried.
>>>>>>>>>>> I got the same behavior.
>>>>>>>>>>> If you could enlighten me why this is happening, I would really
>>>>>>>>>>> Appreciate it.
>>>>>>>>>>> Thank you.
>>>>>>>>>>> Tena
>>>>>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>>>> Hi jeff,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the firewall tip. I tried it while allowing all tip
>>>>>>>>>>>> traffic
>>>>>>>>>>>> and got interesting and preplexing result. Here's what's
>>>>>>>>>>>> interesting
>>>>>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this
>>>>>>>>>>>> run):
>>>>>>>>>>>>
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>>>>> Host key verification failed.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>> A daemon (pid 2743) died unexpectedly with status 255 while
>>>>>>>>>>>> attempting
>>>>>>>>>>>> to launch so we are aborting.
>>>>>>>>>>>>
>>>>>>>>>>>> There may be more information reported by the environment (see
>>>>>>>>>>>> above).
>>>>>>>>>>>>
>>>>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>>>>> shared
>>>>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>>>>>> have
>>>>>>>>>>>> the
>>>>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>> process
>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>>
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ env | grep LD_LIB
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
>>>>>>>>>>>> /usr/local/lib
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ export
>>>>>>>>>>>> LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I better to this on machine B as
>>>>>>>>>>>> well
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
>>>>>>>>>>>> Warning: Identity file tsakai not accessible: No such file or
>>>>>>>>>>>> directory.
>>>>>>>>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
>>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ export
>>>>>>>>>>>> LD_LIBRARY_PATH='/usr/local/lib'
>>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ env | grep LD_LIB
>>>>>>>>>>>> LD_LIBRARY_PATH=/usr/local/lib
>>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ # OK, now go bak to machine A
>>>>>>>>>>>> [tsakai_at_ip-10-195-171-159 ~]$ exit
>>>>>>>>>>>> logout
>>>>>>>>>>>> Connection to ip-10-195-171-159 closed.
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ hostname
>>>>>>>>>>>> ip-10-203-21-132
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # try mpirun again
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ mpirun --app app.ac2
>>>>>>>>>>>> Host key verification failed.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>> A daemon (pid 2789) died unexpectedly with status 255 while
>>>>>>>>>>>> attempting
>>>>>>>>>>>> to launch so we are aborting.
>>>>>>>>>>>>
>>>>>>>>>>>> There may be more information reported by the environment (see
>>>>>>>>>>>> above).
>>>>>>>>>>>>
>>>>>>>>>>>> This may be because the daemon was unable to find all the needed
>>>>>>>>>>>> shared
>>>>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>>>>>> have
>>>>>>>>>>>> the
>>>>>>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>> process
>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> >>
>>>>>>>>
>>>>>> -
>>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>>
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # I thought openmpi library was in
>>>>>>>>>>>> /usr/local/lib...
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
>>>>>>>>>>>> total 16604
>>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so ->
>>>>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 ->
>>>>>>>>>>>> libfuse.so.2.8.5
>>>>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so ->
>>>>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1
>>>>>>>>>>>> ->
>>>>>>>>>>>> libmca_common_sm.so.1.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so ->
>>>>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 ->
>>>>>>>>>>>> libmpi.so.0.0.2
>>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so ->
>>>>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 ->
>>>>>>>>>>>> libmpi_cxx.so.0.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so ->
>>>>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 ->
>>>>>>>>>>>> libmpi_f77.so.0.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so ->
>>>>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 ->
>>>>>>>>>>>> libmpi_f90.so.0.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so ->
>>>>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 ->
>>>>>>>>>>>> libopen-pal.so.0.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so ->
>>>>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 ->
>>>>>>>>>>>> libopen-rte.so.0.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so ->
>>>>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0
>>>>>>>>>>>> ->
>>>>>>>>>>>> libopenmpi_malloc.so.0.0.0
>>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so ->
>>>>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 ->
>>>>>>>>>>>> libulockmgr.so.1.0.1
>>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so ->
>>>>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 ->
>>>>>>>>>>>> libxml2.so.2.7.2
>>>>>>>>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$ # Now, I am really confused...
>>>>>>>>>>>> [tsakai_at_ip-10-203-21-132 ~]$
>>>>>>>>>>>>
>>>>>>>>>>>> Do you know why it's complaining about shared libraries?
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Tena
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like
>>>>>>>>>>>> you
>>>>>>>>>>>> might
>>>>>>>>>>>> have firewall issues.
>>>>>>>>>>>>
>>>>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to
>>>>>>>>>>>> mpirun
>>>>>>>>>>>> for
>>>>>>>>>>>> various command and control reasons. If it is blocked from doing
>>>>>>>>>>>> so
>>>>>>>>>>>> by
>>>>>>>>>>>> a
>>>>>>>>>>>> firewall, Open MPI won't run. In general, you can either disable
>>>>>>>>>>>> your
>>>>>>>>>>>> firewall or you can setup a trust relationship for TCP connections
>>>>>>>>>>>> within
>>>>>>>>>>>> your
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Reuti,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete
>>>>>>>>>>>> session is captured in the attached file.
>>>>>>>>>>>>
>>>>>>>>>>>> What I did is much similar to what I have done before: verify
>>>>>>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy
>>>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First
>>>>>>>>>>>> from an scp invocation and then from mpirun invocation. They both
>>>>>>>>>>>> say
>>>>>>>>>>>> debug1: Authentication succeeded (publickey).
>>>>>>>>>>>>
>>>>>>>>>>>> From mpirun invocation, I see a line:
>>>>>>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca
>>>>>>>>>>>> orte_ess_num_procs
>>>>>>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
>>>>>>>>>>>> The IP address at the end of the line is indeed that of machine B.
>>>>>>>>>>>> After that there was hanging and I controlled-C out of it, which
>>>>>>>>>>>> gave me more lines. But the lines after
>>>>>>>>>>>> debug1: Sending command: orted bla bla bla
>>>>>>>>>>>> doesn't look good to me. But, in truth, I have no idea what they
>>>>>>>>>>>> mean.
>>>>>>>>>>>>
>>>>>>>>>>>> If you could shed some light, I would appreciate it very much.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Tena
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>>>>>>>>>>>>
>>>>>>>>>>>> your local machine is Linux like, but the execution hosts
>>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output.
>>>>>>>>>>>> No, my environment is entirely linux. The path to my home
>>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to
>>>>>>>>>>>> itself as /home/tsakai). For historical reasons, I have
>>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen.
>>>>>>>>>>>> okay. Sometimes the protection of the home directory must be
>>>>>>>>>>>> adjusted
>>>>>>>>>>>> too,
>>>>>>>>>>>> but
>>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)?
>>>>>>>>>>>> It would also be an option to use hostbased authentication,
>>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless
>>>>>>>>>>>> ssh-keys for each user.
>>>>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I
>>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface,
>>>>>>>>>>>> but to address from one amazon cluster node to the other I
>>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>>>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names
>>>>>>>>>>>> change from a launch to another. I am using passphrasesless
>>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from
>>>>>>>>>>>> Amazon node B back to A. (Please see my initail post. There
>>>>>>>>>>>> is a session dialogue for this.) They all work without authen-
>>>>>>>>>>>> tication dialogue, except a brief initial dialogue:
>>>>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>>>>>>>>>> can't be established.
>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)?
>>>>>>>>>>>> to which I say "yes."
>>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"?
>>>>>>>>>>>> Doesn't that mean with password? If so, it is not an option.
>>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each
>>>>>>>>>>>> users'
>>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But
>>>>>>>>>>>> when
>>>>>>>>>>>> the
>>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It
>>>>>>>>>>>> uses
>>>>>>>>>>>> hostkeys (private and public), this way it works for all users.
>>>>>>>>>>>> Just
>>>>>>>>>>>> for
>>>>>>>>>>>> reference:
>>>>>>>>>>>>
>>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>>>>>>>>>>
>>>>>>>>>>>> You could look into it later.
>>>>>>>>>>>>
>>>>>>>>>>>> ==
>>>>>>>>>>>>
>>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g.
>>>>>>>>>>>> ssh
>>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>>>>>>>>>>>>
>>>>>>>>>>>> - What about putting:
>>>>>>>>>>>>
>>>>>>>>>>>> LogLevel DEBUG3
>>>>>>>>>>>>
>>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to
>>>>>>>>>>>> negotiate
>>>>>>>>>>>> before
>>>>>>>>>>>> it fails in verbose mode.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Tena
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <reuti_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs?
>>>>>>>>>>>> I
>>>>>>>>>>>> saw
>>>>>>>>>>>> the
>>>>>>>>>>>> /Users/tsakai/... in your output.
>>>>>>>>>>>>
>>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh
>>>>>>>>>>>> domU-12-31-39-07-35-21 ls
>>>>>>>>>>>>
>>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I have made a bit of progress(?)...
>>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks
>>>>>>>>>>>> like:
>>>>>>>>>>>> # machine A
>>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal
>>>>>>>>>>>> This is just an abbreviation or nickname above. To use the
>>>>>>>>>>>> specified
>>>>>>>>>>>> settings,
>>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are
>>>>>>>>>>>> the
>>>>>>>>>>>> same
>>>>>>>>>>>> anyway for all machines, you can use:
>>>>>>>>>>>>
>>>>>>>>>>>> Host *
>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>>
>>>>>>>>>>>> instead.
>>>>>>>>>>>>
>>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It
>>>>>>>>>>>> would
>>>>>>>>>>>> also
>>>>>>>>>>>> be
>>>>>>>>>>>> an option to use hostbased authentication, which will avoid setting
>>>>>>>>>>>> any
>>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user.
>>>>>>>>>>>>
>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> HostName domU-12-31-39-07-35-21
>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>
>>>>>>>>>>>> # machine B
>>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal
>>>>>>>>>>>> HostName domU-12-31-39-06-74-E2
>>>>>>>>>>>> BatchMode yes
>>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>>>>>>>>>> ChallengeResponseAuthentication no
>>>>>>>>>>>> IdentitiesOnly yes
>>>>>>>>>>>>
>>>>>>>>>>>> This file exists on both machine A and machine B.
>>>>>>>>>>>>
>>>>>>>>>>>> Now When I issue mpirun command as below:
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>>>>>>>>>>>
>>>>>>>>>>>> It hungs. I control-C out of it and I get:
>>>>>>>>>>>> mpirun: killing job...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> ->
>>>>>>>>>>>> -
>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>> process
>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> ->
>>>>>>>>>>>> -
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> ->
>>>>>>>>>>>> -
>>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>>>>>>>> shown
>>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>>>>>>> the "orte-clean" tool for assistance.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> ->
>>>>>>>>>>>> -
>>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not
>>>>>>>>>>>> report
>>>>>>>>>>>> back when launched
>>>>>>>>>>>>
>>>>>>>>>>>> Am I making progress?
>>>>>>>>>>>>
>>>>>>>>>>>> Does this mean I am past authentication and something else is the
>>>>>>>>>>>> problem?
>>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There
>>>>>>>>>>>> are
>>>>>>>>>>>> so
>>>>>>>>>>>> many keyword-argument paris for this config file and I would like
>>>>>>>>>>>> to
>>>>>>>>>>>> look
>>>>>>>>>>>> at
>>>>>>>>>>>> some very basic one that works.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>> tsakai_at_[hidden]
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsakai_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> I have an app.ac1 file like below:
>>>>>>>>>>>> [tsakai_at_vixen local]$ cat app.ac1
>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
>>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript
>>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>>>>>>>>>>>>
>>>>>>>>>>>> The program I run is
>>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
>>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s the program fib.R:
>>>>>>>>>>>> [ tsakai_at_vixen local]$ cat fib.R
>>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively
>>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11)
>>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>>>>>>>>>>>>
>>>>>>>>>>>> fib <- function( n ) {
>>>>>>>>>>>> a <- 0
>>>>>>>>>>>> b <- 1
>>>>>>>>>>>> for ( i in 1:n ) {
>>>>>>>>>>>> t <- b
>>>>>>>>>>>> b <- a
>>>>>>>>>>>> a <- a + t
>>>>>>>>>>>> }
>>>>>>>>>>>> a
>>>>>>>>>>>>
>>>>>>>>>>>> arg <- commandArgs( TRUE )
>>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE )
>>>>>>>>>>>> cat( fib(arg), myHost, '\n' )
>>>>>>>>>>>>
>>>>>>>>>>>> It reads an argument from command line and produces a fibonacci
>>>>>>>>>>>> number
>>>>>>>>>>>> that
>>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty
>>>>>>>>>>>> simple
>>>>>>>>>>>> stuff.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s the run output:
>>>>>>>>>>>> [tsakai_at_vixen local]$ mpirun -app app.ac1
>>>>>>>>>>>> 5 vixen.egcrc.org
>>>>>>>>>>>> 8 vixen.egcrc.org
>>>>>>>>>>>> 13 blitzen.egcrc.org
>>>>>>>>>>>> 21 blitzen.egcrc.org
>>>>>>>>>>>>
>>>>>>>>>>>> Which is exactly what I expect. So far so good.
>>>>>>>>>>>>
>>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of
>>>>>>>>>>>> the
>>>>>>>>>>>> same
>>>>>>>>>>>> virtual machine, to which I get to by:
>>>>>>>>>>>> [tsakai_at_vixen local]$ ssh ­A I ~/.ssh/tsakai
>>>>>>>>>>>> machine-instance-A-public-dns
>>>>>>>>>>>>
>>>>>>>>>>>> Now I am on machine A:
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B
>>>>>>>>>>>> without
>>>>>>>>>>>> password authentication,
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-0C-C8-01
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine
>>>>>>>>>>>> A
>>>>>>>>>>>> without using password
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)'
>>>>>>>>>>>> can't
>>>>>>>>>>>> be established.
>>>>>>>>>>>> RSA key fingerprint is
>>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes
>>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the
>>>>>>>>>>>> list
>>>>>>>>>>>> of
>>>>>>>>>>>> known hosts.
>>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ exit
>>>>>>>>>>>> logout
>>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed.
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-0C-C8-01 ~]$ exit
>>>>>>>>>>>> logout
>>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed.
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ # back at machine A
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ hostname
>>>>>>>>>>>> domU-12-31-39-00-D1-F2
>>>>>>>>>>>>
>>>>>>>>>>>> As you can see, neither machine uses password for authentication;
>>>>>>>>>>>> it
>>>>>>>>>>>> uses
>>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for
>>>>>>>>>>>> ssh
>>>>>>>>>>>> invocation
>>>>>>>>>>>> from one machine to the other. This is so because I have a copy of
>>>>>>>>>>>> public
>>>>>>>>>>>> key
>>>>>>>>>>>> and a copy of private key on each instance.
>>>>>>>>>>>>
>>>>>>>>>>>> The app.ac file is identical, except the node names:
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ cat app.ac1
>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5
>>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6
>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7
>>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s what happens with mpirun:
>>>>>>>>>>>>
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1
>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password:
>>>>>>>>>>>> Permission denied, please try again.
>>>>>>>>>>>> tsakai_at_domu-12-31-39-0c-c8-01's password: mpirun: killing job...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> ->
>>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>>> --
>>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>>>>>>>> process
>>>>>>>>>>>> that caused that situation.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>> --
>>>>>>>>>>> ->
>>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> mpirun: clean termination accomplished
>>>>>>>>>>>>
>>>>>>>>>>>> [tsakai_at_domU-12-31-39-00-D1-F2 ~]$
>>>>>>>>>>>>
>>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have.
>>>>>>>>>>>> I end up typing control-C.
>>>>>>>>>>>>
>>>>>>>>>>>> Here¹s my question:
>>>>>>>>>>>> How can I get past authentication by mpirun where there is no
>>>>>>>>>>>> password?
>>>>>>>>>>>>
>>>>>>>>>>>> I would appreciate your help/insight greatly.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Tena Sakai
>>>>>>>>>>>> tsakai_at_[hidden]
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users