Hi Jeff,
Thank you for your suggestions. I followed your steps verbatim.
Unfortunately, there is a bit of problem. Here's what I did:
[tsakai_at_vixen ec2]$ ssh -i $MYKEY
tsakai_at_[hidden]
The authenticity of host 'ec2-184-73-62-72.compute-1.amazonaws.com
(184.73.62.72)' can't be established.
RSA key fingerprint is cb:52:71:49:63:c2:52:58:9c:2e:04:46:f7:4e:b9:13.
Are you sure you want to continue connecting (yes/no)? yes
Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_ip-10-194-215-32 ~]$ # this is instance A
[tsakai_at_ip-10-194-215-32 ~]$ nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: ip-10-194-215-32.ec2.internal
Address: 10.194.215.32
[tsakai_at_ip-10-194-215-32 ~]$
[tsakai_at_ip-10-194-215-32 ~]$ rm -rf $HOME/.ssh
[tsakai_at_ip-10-194-215-32 ~]$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/tsakai/.ssh/id_dsa):
Created directory '/home/tsakai/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/tsakai/.ssh/id_dsa.
Your public key has been saved in /home/tsakai/.ssh/id_dsa.pub.
The key fingerprint is:
54:eb:bd:e7:f2:52:24:49:94:7b:7a:9e:e4:b7:0b:04 tsakai_at_ip-10-194-215-32
The key's randomart image is:
+--[ DSA 1024]----+
| .... |
| . .o |
| . .E o |
| . . .= o |
| S . .* |
| o.+ |
| .B.. |
| oo= .|
| +o+o|
+-----------------+
[tsakai_at_ip-10-194-215-32 ~]$
[tsakai_at_ip-10-194-215-32 ~]$ cd $HOME/.ssh
[tsakai_at_ip-10-194-215-32 .ssh]$ ll
total 8
-rw------- 1 tsakai tsakai 668 Feb 18 02:15 id_dsa
-rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:15 id_dsa.pub
[tsakai_at_ip-10-194-215-32 .ssh]$
[tsakai_at_ip-10-194-215-32 .ssh]$ cp id_dsa.pub authorized_keys
[tsakai_at_ip-10-194-215-32 .ssh]$ chmod 644 authorized_keys
[tsakai_at_ip-10-194-215-32 .ssh]$
[tsakai_at_ip-10-194-215-32 .ssh]$ ll
total 12
-rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:16 authorized_keys
-rw------- 1 tsakai tsakai 668 Feb 18 02:15 id_dsa
-rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:15 id_dsa.pub
[tsakai_at_ip-10-194-215-32 .ssh]$
Now the next step is to go to instance B via ssh. This doesn't
work for me because the id_dsa on instance A at this point is
not the counterpart (id_dsa.pub) that's kept on instance B. Here
is what happens:
[tsakai_at_ip-10-194-215-32 .ssh]$ ssh ip-10-196-61-219.ec2.internal
The authenticity of host 'ip-10-196-61-219.ec2.internal (10.196.61.219)'
can't be established.
RSA key fingerprint is e5:ab:5b:d1:67:2c:ec:7e:33:3c:b8:b3:8a:73:5e:e9.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ip-10-196-61-219.ec2.internal,10.196.61.219'
(RSA) to the list of known hosts.
Permission denied (publickey).
I got onto instance B directly from my local machine and did the same
as what I did on A:
[tsakai_at_vixen ec2]$ ssh -i $MYKEY
tsakai_at_[hidden]
The authenticity of host 'ec2-67-202-49-161.compute-1.amazonaws.com
(67.202.49.161)' can't be established.
RSA key fingerprint is e5:ab:5b:d1:67:2c:ec:7e:33:3c:b8:b3:8a:73:5e:e9.
Are you sure you want to continue connecting (yes/no)? yes
Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_ip-10-196-61-219 ~]$
[tsakai_at_ip-10-196-61-219 ~]$ # this is instance B
[tsakai_at_ip-10-196-61-219 ~]$ nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: ip-10-196-61-219.ec2.internal
Address: 10.196.61.219
[tsakai_at_ip-10-196-61-219 ~]$
[tsakai_at_ip-10-196-61-219 ~]$ rm -rf $HOME/.ssh
[tsakai_at_ip-10-196-61-219 ~]$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/tsakai/.ssh/id_dsa):
Created directory '/home/tsakai/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/tsakai/.ssh/id_dsa.
Your public key has been saved in /home/tsakai/.ssh/id_dsa.pub.
The key fingerprint is:
dd:c1:73:97:50:eb:d1:ad:84:94:0f:98:51:b2:8d:4a tsakai_at_ip-10-196-61-219
The key's randomart image is:
+--[ DSA 1024]----+
| o=oo.. |
| oBo.. =|
| E o *oo++|
| . o . =oo.|
| S . . .. |
| |
| |
| |
| |
+-----------------+
[tsakai_at_ip-10-196-61-219 ~]$
Now comes another failure from the instance B:
[tsakai_at_ip-10-196-61-219 ~]$ scp
@ip-10-194-215-32.ec2.internal:.ssh/id_rsa\* .
The authenticity of host 'ip-10-194-215-32.ec2.internal (10.194.215.32)'
can't be established.
RSA key fingerprint is cb:52:71:49:63:c2:52:58:9c:2e:04:46:f7:4e:b9:13.
Are you sure you want to continue connecting (yes/no)?
Host key verification failed.
[tsakai_at_ip-10-196-61-219 ~]$
I have seen these problems many times over last few days and I have
worked it out. The failure occurs because, in order to do silent
authentication, it wants to see an indentity of destination machine
in known_hosts file in .ssh directory. One way to get around this
is to use -i flag (which requires private key) of ssh once. If that
is done from both directions, then ssh can do authentication silently.
Essentially, I had done exactly the same thing as your instruction
indicate. Only I didn't use dsa, I used rsa. I don't think that is
a roadblock, is it?
[tsakai_at_vixen ec2]$ ssh -i $MYKEY
tsakai_at_[hidden]
The authenticity of host 'ec2-50-17-48-206.compute-1.amazonaws.com
(50.17.48.206)' can't be established.
RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd.
Are you sure you want to continue connecting (yes/no)? yes
Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_ip-10-110-10-137 ~]$
[tsakai_at_ip-10-110-10-137 ~]$ nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: ip-10-110-10-137.ec2.internal
Address: 10.110.10.137
[tsakai_at_ip-10-110-10-137 ~]$
[tsakai_at_ip-10-110-10-137 ~]$ cd .ssh
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ll
total 12
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 config
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ # there is no known_hosts file, which we
need.
[tsakai_at_ip-10-110-10-137 .ssh]$ # to create it, we need to hide config
[tsakai_at_ip-10-110-10-137 .ssh]$ mv config __config
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ssh -i tsakai
tsakai_at_ip-10-110-10-137.ec2.internal
The authenticity of host 'ip-10-110-10-137.ec2.internal (10.110.10.137)'
can't be established.
RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ip-10-110-10-137.ec2.internal,10.110.10.137'
(RSA) to the list of known hosts.
Last login: Fri Feb 18 04:20:29 2011 from 63.193.205.1
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_ip-10-110-10-137 ~]$
[tsakai_at_ip-10-110-10-137 ~]$ cd .ssh
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ll
total 16
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config
-rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:22 known_hosts
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ # I ssh'ed to the same instance
[tsakai_at_ip-10-110-10-137 .ssh]$ who
tsakai pts/0 2011-02-18 04:20 (63.193.205.1)
tsakai pts/1 2011-02-18 04:22 (ip-10-110-10-137.ec2.internal)
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ exit
logout
Connection to ip-10-110-10-137.ec2.internal closed.
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ who
tsakai pts/0 2011-02-18 04:20 (63.193.205.1)
[tsakai_at_ip-10-110-10-137 .ssh]$
total 16
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config
-rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:22 known_hosts
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ # known_hosts file got made
[tsakai_at_ip-10-110-10-137 .ssh]$ # what's in it?
[tsakai_at_ip-10-110-10-137 .ssh]$ wc known_hosts
1 3 425 known_hosts
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ cat known_hosts
ip-10-110-10-137.ec2.internal,10.110.10.137 ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEAyEMhrftyAg637XzteErroLE2Uf2PgrPz7S/Hs0Tyedk9ooWO
iIzlpTq3fEGXeZIZ4sMMiwuFQuF60TSkCUKSx9sZi8ce2Tvck1uTNrki/rlP11gY/aJ1oFW9Gg7A
LT2B8xPFThoSZntjMXYwRxxHwqVza0ELCxMV+kk6bdGeTPvFjl3tnyKEQJsdy8/HZy8v2jvFaWRq
Pzc6JIACEdkZ2AArN8Xh33yHFlOQ6XGwf86ZIqwWrbBH4Cvo6058rs9VDjzdBKcdM1D7K5ea5lF1
QGGEzfsUl7dVq6Z1UWnZoI9bqc1Mw+tpW08T2VCm0Dhz7V/UUHRtVGljQmaucpx9aw==
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ # now go to instance B
[tsakai_at_ip-10-110-10-137 .ssh]$ ssh -i tsakai
tsakai_at_domU-12-31-39-16-C6-70.compute-1.internal
The authenticity of host 'domu-12-31-39-16-c6-70.compute-1.internal
(10.96.197.154)' can't be established.
RSA key fingerprint is 2e:8b:83:39:02:9f:48:d6:fd:49:2f:82:96:0b:84:35.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added
'domu-12-31-39-16-c6-70.compute-1.internal,10.96.197.154' (RSA) to the list
of known hosts.
Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ # I am on instance B
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: domU-12-31-39-16-C6-70.compute-1.internal
Address: 10.96.197.154
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ cd .ssh
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ ll
total 12
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 config
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ # the same trick
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ mv config __config
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ ssh -i tsakai
tsakai_at_ip-10-110-10-137.ec2.internal
The authenticity of host 'ip-10-110-10-137.ec2.internal (10.110.10.137)'
can't be established.
RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ip-10-110-10-137.ec2.internal,10.110.10.137'
(RSA) to the list of known hosts.
Last login: Fri Feb 18 04:22:24 2011 from ip-10-110-10-137.ec2.internal
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_ip-10-110-10-137 ~]$
[tsakai_at_ip-10-110-10-137 ~]$ # I am on instance A
[tsakai_at_ip-10-110-10-137 ~]$ # go back to instance B
[tsakai_at_ip-10-110-10-137 ~]$ exit
logout
Connection to ip-10-110-10-137.ec2.internal closed.
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ ll
total 16
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config
-rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:27 known_hosts
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ # known_hosts got made
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ cat known_hosts
ip-10-110-10-137.ec2.internal,10.110.10.137 ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEAyEMhrftyAg637XzteErroLE2Uf2PgrPz7S/Hs0Tyedk9ooWO
iIzlpTq3fEGXeZIZ4sMMiwuFQuF60TSkCUKSx9sZi8ce2Tvck1uTNrki/rlP11gY/aJ1oFW9Gg7A
LT2B8xPFThoSZntjMXYwRxxHwqVza0ELCxMV+kk6bdGeTPvFjl3tnyKEQJsdy8/HZy8v2jvFaWRq
Pzc6JIACEdkZ2AArN8Xh33yHFlOQ6XGwf86ZIqwWrbBH4Cvo6058rs9VDjzdBKcdM1D7K5ea5lF1
QGGEzfsUl7dVq6Z1UWnZoI9bqc1Mw+tpW08T2VCm0Dhz7V/UUHRtVGljQmaucpx9aw==
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ mv __config config
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ ll
total 16
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 config
-rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:27 known_hosts
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ # go back to instance A
[tsakai_at_domU-12-31-39-16-C6-70 .ssh]$ exit
logout
Connection to domU-12-31-39-16-C6-70.compute-1.internal closed.
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ll
total 16
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config
-rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ mv __config config
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ll
total 16
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 config
-rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ # now show I can ssh without -i flag
silently
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ssh
domU-12-31-39-16-C6-70.compute-1.internal
Last login: Fri Feb 18 04:25:56 2011 from ip-10-110-10-137.ec2.internal
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ # and to instance A
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ ssh ip-10-110-10-137.ec2.internal
Last login: Fri Feb 18 04:27:36 2011 from
domu-12-31-39-16-c6-70.compute-1.internal
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai_at_ip-10-110-10-137 ~]$
[tsakai_at_ip-10-110-10-137 ~]$ # OK
[tsakai_at_ip-10-110-10-137 ~]$ # go back to instance B
[tsakai_at_ip-10-110-10-137 ~]$ exit
logout
Connection to ip-10-110-10-137.ec2.internal closed.
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ env | grep -i path
LD_LIBRARY_PATH=:/usr/local/lib
PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin:/home/tsakai/bin
AWS_PATH=/opt/aws
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ # check firewall
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ sudo service iptables status
iptables: Firewall is not running.
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ ll -t /usr/local/lib | head
total 4100
-rw-r--r-- 1 root root 385864 Feb 16 01:33 libvt.a
-rw-r--r-- 1 root root 154950 Feb 16 01:33 libvt.fmpi.a
-rw-r--r-- 1 root root 567848 Feb 16 01:33 libvt.mpi.a
-rw-r--r-- 1 root root 462838 Feb 16 01:33 libvt.omp.a
-rw-r--r-- 1 root root 643482 Feb 16 01:33 libvt.ompi.a
-rw-r--r-- 1 root root 231278 Feb 16 01:33 libotf.a
-rwxr-xr-x 1 root root 891 Feb 16 01:33 libotf.la
drwxr-xr-x 2 root root 4096 Feb 16 01:33 openmpi
-rwxr-xr-x 1 root root 991 Feb 16 01:33 libmca_common_sm.la
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ sudo find / -name mpirun
/usr/local/bin/mpirun
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ cat .ssh/config
Host *
IdentityFile /home/tsakai/.ssh/tsakai
IdentitiesOnly yes
BatchMode yes
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ # try mpirun without the other machine
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ mpirun --host `hostname` -np 2 hostname
domU-12-31-39-16-C6-70
domU-12-31-39-16-C6-70
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ mpirun --host
domU-12-31-39-16-C6-70.compute-1.internal -np 2 hostname
domU-12-31-39-16-C6-70
domU-12-31-39-16-C6-70
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ # now add an extra host
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ mpirun --host \
>
domU-12-31-39-16-C6-70.compute-1.internal,ip-10-110-10-137.ec2.internal \
> -np 2 \
> hostname
# it is hanging
# let me issue control-c
^Cmpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
ip-10-110-10-137.ec2.internal - daemon did not report back when
launched
[tsakai_at_domU-12-31-39-16-C6-70 ~]$
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ # go back to machine A
[tsakai_at_domU-12-31-39-16-C6-70 ~]$ exit
logout
Connection to domU-12-31-39-16-C6-70.compute-1.internal closed.
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ll
total 16
-rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
-rw------- 1 tsakai tsakai 81 Feb 16 04:10 config
-rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts
-rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ sudo service iptables status
iptables: Firewall is not running.
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ sudo find / -name mpirun
/usr/local/bin/mpirun
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ env | grep -i path
LD_LIBRARY_PATH=:/usr/local/lib
PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin:/home/tsakai/bin
AWS_PATH=/opt/aws
[tsakai_at_ip-10-110-10-137 .ssh]$ cat config
Host *
IdentityFile /home/tsakai/.ssh/tsakai
IdentitiesOnly yes
BatchMode yes
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ mpirun --host `hostname` -np 2 hostname
ip-10-110-10-137
ip-10-110-10-137
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ mpirun --host
ip-10-110-10-137.ec2.internal -np 2 hostname
ip-10-110-10-137
ip-10-110-10-137
[tsakai_at_ip-10-110-10-137 .ssh]$ # add the other instance
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ mpirun --host \
>
ip-10-110-10-137.ec2.internal,domU-12-31-39-16-C6-70.compute-1.internal \
> -np 2 \
> hostname
# again hanging; issuing control-c
^Cmpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
domU-12-31-39-16-C6-70.compute-1.internal - daemon did not report
back when launched
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ # try with IP
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: ip-10-110-10-137.ec2.internal
Address: 10.110.10.137
[tsakai_at_ip-10-110-10-137 .ssh]$ mpirun --host 10.110.10.137 -np 2 hostname
ip-10-110-10-137
ip-10-110-10-137
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ ssh
domU-12-31-39-16-C6-70.compute-1.internal 'nslookup domU-12-31-39-16-C6-70'
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: domU-12-31-39-16-C6-70.compute-1.internal
Address: 10.96.197.154
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ mpirun --host \
> 10.110.10.137,10.96.197.154 \
> -np 2 hostname
# hanging also, get out by control-d
^Cmpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
10.96.197.154 - daemon did not report back when launched
[tsakai_at_ip-10-110-10-137 .ssh]$
[tsakai_at_ip-10-110-10-137 .ssh]$ # I can't figure out what more to do....
[tsakai_at_ip-10-110-10-137 .ssh]$ exit
logout
[tsakai_at_vixen ec2]$
Do you see anything incorrect in what I am doing?
Thank you.
Regards,
Tena
On 2/17/11 6:52 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
> On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote:
>
>> For now, may I point out something I noticed out of the
>> DEBUG3 Output last night?
>>
>> I found this line:
>>
>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>
> What this means is that ssh sent the "orted ..." command to the remote side.
>
> As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" --
> it's a helper thingy that mpirun launches on the remote nodes before launching
> your actual application. All those parameters (from --daemonize through
> ...:56064") are options for orted.
>
> All of that gorp is considered internal to Open MPI -- most people never see
> that stuff.
>
>> Followed by:
>>
>>> debug2: channel 0: request exec confirm 1
>>> debug2: fd 3 setting TCP_NODELAY
>>> debug2: callback done
>>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>>> debug3: Wrote 272 bytes for a total of 1893
>>> debug2: channel 0: rcvd adjust 2097152
>>> debug2: channel_input_status_confirm: type 99 id 0
>
> This is just more status information about the ssh connection; it doesn't
> really have any direct relation to Open MPI.
>
> I don't know offhand if ssh displays the ack that a command successfully ran.
> If you're not convinced that it did, then login to the other node while the
> command is hung and run a ps to see if the orted is actually running or not.
> I *suspect* that it is running, but that it's just hung for some reason.
>
> -----
>
> Here's some suggestions to try debugging:
>
> On your new linux AMI instances (some of this may be redundant with what you
> did already):
>
> - ensure that firewalling is disabled on all instances
>
> - ensure that your .bashrc (or whatever startup file is relevant to your
> shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI
> installation. Ensure the *PREFIX* these variables to guarantee that you don't
> get interference from already-installed versions of Open MPI (e.g., if Open
> MPI is installed by default on your AMI and you weren't aware of it)
>
> - setup a simple, per-user SSH key, perhaps something like this:
>
> A$ rm -rf $HOME/.ssh
> (remove what you had before; let's just start over)
>
> A$ ssh-keygen -t dsa
> (hit enter to accept all defaults and set no passphrase)
>
> A$ cd $HOME/.ssh
> A$ cp id_dsa.pub authorized_keys
> A$ chmod 644 authorized_keys
> A$ ssh othernode
> (login to node B)
>
> B$ ssh-keygen -t dsa
> (hit enter to accept all defaults and set no passphrase; just to create
> $HOME/.ssh with the right permissions, etc.)
>
> B$ scp @firstnode:.ssh/id_dsa\* .
> (enter your password on A -- we're overwriting all the files here)
>
> B$ cp id_dsa.pub authorized_keys
> B$ chmod 644 authorized_keys
>
> Now you should be able to ssh from one node to the other without passwords:
>
> A$ ssh othernode hostname
> B
> A$
>
> and
>
> B$ ssh firstnode hostname
> A
> B$
>
> Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to
> ensure that non-interactive logins work properly. That's what Open MPI will
> use under the covers.
>
> - Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh
> sessions (i.e., some .bashrc's will exit "early" if they detect that it is a
> non-interactive session). For example:
>
> A$ ssh othernode env | grep -i path
>
> Ensure that the output shows the path and ld_library_path locations for Open
> MPI at the beginning of those variables. To go for the gold, you can try
> this, too:
>
> A$ ssh othernode which ompi_info
> (if all paths are set right, this should show the ompi_info of your 1.4.3
> install)
> A$ ssh othernode ompi_info
> (should show all the info about your 1.4.3 install)
>
> - If all the above works, then test with a simple, non-MPI application across
> both nodes:
>
> A$ mpirun --host firstnode,othernode -np 2 hostname
> A
> B
> A$
>
> - When that works, you should be able to test with a simple MPI application
> (e.g., the examples/ring_c.c file in the Open MPI distribution):
>
> A$ cd /path/to/open/mpi/source
> A$ cd examples
> A$ make
> ...
> A$ scp ring_c @othernode:/path/to/open/mpi/source/examples
> ...
> A$ mpirun --host firstnode,othernode -np 4 ring_c
>
> Make sense?
|