Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] This must be ssh problem, but I can't figure out what it is...
From: Tena Sakai (tsakai_at_[hidden])
Date: 2011-02-18 04:15:12


Hi Gus,

I am starting to see the light at the other end of the tunnel.
As I wrote in reply to Jeff, it was not a ssh problem. It was
a setting of user configurable firewall that Amazon calls
security group. I need to expand my small tests to wider
set, but I think I can do that. I will keep you posted in
coming days/weeks.

Many thanks for your help and dialog. I really appreciate
your help and explanations.

Thank you!

Regards,

Tena

On 2/16/11 4:31 PM, "Gus Correa" <gus_at_[hidden]> wrote:

> Hi Tena
>
> Again, I think your EC2 session log with ssh debug3 level (below)
> should be looked at by somebody more knowledgeable in OpenMPI
> and in ssh that me.
> There must be some clue to what is going on there.
>
> Ssh experts, Jeff, Ralph, please help!
>
> Anyway ...
> AFAIK, 'orted' in the first line you selected/highlighted below,
> is the 'Openmpi Run Time Environment Daemon' ( ... the OpenMPI pros
> are authorized to send me to the galleys if it is not ...).
> So, orted is trying to do its thing, to create the conditions for your
> job to run across the two EC2 'instances'. (Gone are the naive
> days when these things were computers, each one on its box ...)
> This master or ceremonies' work of orted is done via tcp, and I guess
> 10.96.118.236 is the IP (of computer B?),
> and 56064 is probably the port,
> where orted may be trying to open a socket.
> The bunch of -mca parameters are just what they are: MCA parameters
> (MCA=Modular Component Architecture of OpenMPI, and here I am risking to
> be shanghaied or ridiculed again ...).
> (You can learn more about the mca parameters with 'ompi_info -help'.)
> That is how in my ignorance I parse that line.
>
> So, from the computer/instance-A side orted gives the first kick,
> but somehow the ball never comes back from computer/instance-B.
> It's ping- without -pong.
> The same frustrating feeling I had when I was a kid and kicked the
> soccer ball on the neighbor's side and would never see it again.
>
> Cheers,
> Gus
>
> Tena Sakai wrote:
>> Hi Gus,
>>
>> Thank you for your reply and suggestions.
>>
>> I will follow up on these in a bit and will give you an
>> update. Looking at what vixen and/or dasher generates
>> from DEBUG3 would be interesting.
>>
>> For now, may I point out something I noticed out of the
>> DEBUG3 Output last night?
>>
>> I found this line:
>>
>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>>
>> Followed by:
>>
>>> debug2: channel 0: request exec confirm 1
>>> debug2: fd 3 setting TCP_NODELAY
>>> debug2: callback done
>>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>>> debug3: Wrote 272 bytes for a total of 1893
>>> debug2: channel 0: rcvd adjust 2097152
>>> debug2: channel_input_status_confirm: type 99 id 0
>>
>> It appears, to my untrained eye/mind, a directive from instance A
>> to B was issued and then what happened? I don't see that was
>> honored by the instance B.
>>
>> Can you please comment on this?
>>
>> Thank you.
>>
>> Regards,
>>
>> Tena
>>
>> On 2/16/11 1:34 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>
>>> Hi Tena
>>>
>>> I hope somebody more knowledgeable in ssh
>>> takes a look at the debug3 session log that you included.
>>>
>>> I can't see if/where/why ssh is failing for you in EC2.
>>>
>>> See other answers inline, please.
>>>
>>> Tena Sakai wrote:
>>>> Hi Gus,
>>>>
>>>> Thank you again for your reply.
>>>>
>>>>> A slight difference is that on vixen and dashen you ran the
>>>>> MPI hostname tests as a regular user, not as root, right?
>>>>> Not sure if this will make much of a difference,
>>>>> but it may be worth trying to run it as a regular user in EC2 also.
>>>>> I general most people avoid running user applications (MPI programs
>>>>> included) as root.
>>>>> Mostly for safety, but I wonder if there are any
>>>>> implications in the 'rootly powers'
>>>>> regarding the under-the-hood processes that OpenMPI
>>>>> launches along with the actual user programs.
>>>> Yes, between vixen and dahser I was doing the test as user tsakai,
>>>> not as root. But the reason I wanted to do this test as root is
>>>> to show that it fails as regular user (generating pipe system
>>>> call failed error), whereas as root it would succeed, as it did
>>>> on Friday.
>>> Sorry again.
>>> I even wrote "root can and Tena cannot", then I forgot.
>>> Too many tasks at the same time, too much context-switching ...
>>>
>>>> The ami has not changed. The last change on the ami
>>>> was last Tuesday. As such I don't understand this inconsistent
>>>> behavior. I have lots of notes from previous sessions and I
>>>> consulted different successful session logs to replicate what I
>>>> saw Friday, but with no success.
>>>>
>>>> Having spent days and not getting anywhere, I decided to take a
>>>> different approach. I instantiated a linux ami that was built by
>>>> Amazon, which feels like centos/fedora-based. I downloaded gcc
>>>> and c++, plus openMPI 1.4.3. After I got openMPI running, I
>>>> created an account for user tsakai, uploaded my public key, re-logged
>>>> in as user tsakai, and ran the same test. Surprisingly (or not?) it
>>>> generated the same result. I.e., I cannot run the same mpirun
>>>> command when there is a remote instance involved, but on itself
>>>> mpirun runs fine. So, I am feeling that this has to be an ssh
>>>> authentication problem. I looked at man page for ssh and ssh_config
>>>> and cannot figure out what I am doing wrong. I put in "LogLevel
>>>> DEBUG3" line and it generated lots of lines, in which I found a
>>>> line:
>>>> debug1: Authentication succeeded (publickey).
>>>> Then I see a bunch of lines that look like:
>>>> debug3: Ignored env XXXXXXX
>>>> and mpirun hangs. Here is the session log:
>>>>
>>> Ssh on our clusters uses host-based authentication.
>>> I think Reuti sent you his page about it:
>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>>
>>> However, I believe OpenMPI shouldn't care which ssh authentication
>>> mechanism is used, as long as it works passwordless.
>>>
>>> As for ssh configuration, ours is pretty standard:
>>>
>>> 1) We don't have 'IdentitiesOnly yes' (default is 'no'),
>>> but use standard identity file names id_rsa, etc.
>>> I think you are just telling ssh to use the specific identity
>>> file you named.
>>> I don't know if this may cause the problem, but who knows?
>>>
>>> 2) We don't have 'BatchMode yes' set.
>>>
>>> 3) We have the GSS authentication set
>>>
>>> GSSAPIAuthentication yes
>>>
>>> 4) The locale environment variables are also passed
>>> (may not be crucial):
>>>
>>> SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY
>>> LC_MESSAGES
>>> SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
>>> SendEnv LC_IDENTIFICATION LC_ALL
>>>
>>> 5) And X forwarding (you're not doing any X stuff, I suppose):
>>>
>>> ForwardX11Trusted yes
>>>
>>> 6) However, you may want to check what is in your
>>> /etc/ssh/ssh_config and /etc/ssh/sshd_config,
>>> because some options may be already set there.
>>>
>>> 7) Take a look at 'man ssh[d]' and 'man ssh[d]_config' too.
>>>
>>> ***
>>>
>>> Finally, if you are willing to, it may be worth to run the same
>>> experiment (with debug3) on vixen @ dashen, just to compare what
>>> comes out from the verbose ssh messages to what you see in EC2.
>>> Perhaps it may help nail down the reason for failure.
>>>
>>> Gus Correa
>>>
>>>
>>>
>>>> [tsakai_at_vixen ec2]$
>>>> [tsakai_at_vixen ec2]$ ssh -i $MYKEY
>>>> tsakai_at_[hidden]
>>>> Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1
>>>>
>>>> __| __|_ ) Amazon Linux AMI
>>>> _| ( / Beta
>>>> ___|\___|___|
>>>>
>>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>>> :-)
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # show firewall is off
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ service iptables status
>>>> -bash: service: command not found
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ sudo service iptables status
>>>> iptables: Firewall is not running.
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no
>>>> password authentication
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ ssh
>>>> domU-12-31-39-16-4E-4C.compute-1.internal
>>>> Last login: Wed Feb 16 06:53:14 2011 from
>>>> domu-12-31-39-16-75-1e.compute-1.internal
>>>>
>>>> __| __|_ ) Amazon Linux AMI
>>>> _| ( / Beta
>>>> ___|\___|___|
>>>>
>>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>>> :-)
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # also back to inst A
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ ssh
>>>> domU-12-31-39-16-75-1E.compute-1.internal
>>>> Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1
>>>>
>>>> __| __|_ ) Amazon Linux AMI
>>>> _| ( / Beta
>>>> ___|\___|___|
>>>>
>>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>>> :-)
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # OK
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # back to inst B
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ exit
>>>> logout
>>>> Connection to domU-12-31-39-16-75-1E.compute-1.internal closed.
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB
>>>> LD_LIBRARY_PATH=:/usr/local/lib
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ sudo service iptables status
>>>> iptables: Firewall is not running.
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # go back to inst A
>>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ exit
>>>> logout
>>>> Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed.
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB
>>>> LD_LIBRARY_PATH=:/usr/local/lib
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cat app.ac
>>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine);
>>>> bottom 2 are remote inst (inst B)
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>>> ^Cmpirun: killing job...
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>> below. Additional manual cleanup may be required - please refer to
>>>> the "orte-clean" tool for assistance.
>>>>
>>>> --------------------------------------------------------------------------
>>>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>>>> back when launched
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when
>>>> launched ***
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cat app.ac2
>>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A)
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2
>>>> domU-12-31-39-16-75-1E
>>>> domU-12-31-39-16-75-1E
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # that's no problem
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cd .ssh
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>>>> Host *
>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>> IdentitiesOnly yes
>>>> BatchMode yes
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ mv config config.svd
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ ll config
>>>> -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ chmod 600 config
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>>>> Host *
>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>> IdentitiesOnly yes
>>>> BatchMode yes
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat - >> config
>>>> LogLevel DEBUG3
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>>>> Host *
>>>> IdentityFile /home/tsakai/.ssh/tsakai
>>>> IdentitiesOnly yes
>>>> BatchMode yes
>>>> LogLevel DEBUG3
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ ll config
>>>> -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cd ..
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>>> debug2: ssh_connect: needpriv 0
>>>> debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal
>>>> [10.96.77.182] port 22.
>>>> debug1: Connection established.
>>>> debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai.
>>>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>>> debug3: key_read: missing keytype
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug3: key_read: missing whitespace
>>>> debug2: key_type_from_name: unknown key type '-----END'
>>>> debug3: key_read: missing keytype
>>>> debug1: identity file /home/tsakai/.ssh/tsakai type -1
>>>> debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
>>>> debug1: match: OpenSSH_5.3 pat OpenSSH*
>>>> debug1: Enabling compatibility mode for protocol 2.0
>>>> debug1: Local version string SSH-2.0-OpenSSH_5.3
>>>> debug2: fd 3 setting O_NONBLOCK
>>>> debug1: SSH2_MSG_KEXINIT sent
>>>> debug3: Wrote 792 bytes for a total of 813
>>>> debug1: SSH2_MSG_KEXINIT received
>>>> debug2: kex_parse_kexinit:
>>>>
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>>
f
>>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>>> debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.>>>>
l
>>>> iu.se
>>>> debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.>>>>
l
>>>> iu.se
>>>> debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>> debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden],zlib
>>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden],zlib
>>>> debug2: kex_parse_kexinit:
>>>> debug2: kex_parse_kexinit:
>>>> debug2: kex_parse_kexinit: first_kex_follows 0
>>>> debug2: kex_parse_kexinit: reserved 0
>>>> debug2: kex_parse_kexinit:
>>>>
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>>
f
>>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>>> debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.>>>>
l
>>>> iu.se
>>>> debug2: kex_parse_kexinit:
>>>>
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
>>>>
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.>>>>
l
>>>> iu.se
>>>> debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>> debug2: kex_parse_kexinit:
>>>>
hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openss>>>>
h
>>>> .com,hmac-sha1-96,hmac-md5-96
>>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden]
>>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden]
>>>> debug2: kex_parse_kexinit:
>>>> debug2: kex_parse_kexinit:
>>>> debug2: kex_parse_kexinit: first_kex_follows 0
>>>> debug2: kex_parse_kexinit: reserved 0
>>>> debug2: mac_setup: found hmac-md5
>>>> debug1: kex: server->client aes128-ctr hmac-md5 none
>>>> debug2: mac_setup: found hmac-md5
>>>> debug1: kex: client->server aes128-ctr hmac-md5 none
>>>> debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
>>>> debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
>>>> debug3: Wrote 24 bytes for a total of 837
>>>> debug2: dh_gen_key: priv key bits set: 125/256
>>>> debug2: bits set: 489/1024
>>>> debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
>>>> debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
>>>> debug3: Wrote 144 bytes for a total of 981
>>>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>>> debug3: check_host_in_hostfile: match line 1
>>>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>>> debug3: check_host_in_hostfile: match line 1
>>>> debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and
>>>> matches the RSA host key.
>>>> debug1: Found key in /home/tsakai/.ssh/known_hosts:1
>>>> debug2: bits set: 491/1024
>>>> debug1: ssh_rsa_verify: signature correct
>>>> debug2: kex_derive_keys
>>>> debug2: set_newkeys: mode 1
>>>> debug1: SSH2_MSG_NEWKEYS sent
>>>> debug1: expecting SSH2_MSG_NEWKEYS
>>>> debug3: Wrote 16 bytes for a total of 997
>>>> debug2: set_newkeys: mode 0
>>>> debug1: SSH2_MSG_NEWKEYS received
>>>> debug1: SSH2_MSG_SERVICE_REQUEST sent
>>>> debug3: Wrote 48 bytes for a total of 1045
>>>> debug2: service_accept: ssh-userauth
>>>> debug1: SSH2_MSG_SERVICE_ACCEPT received
>>>> debug2: key: /home/tsakai/.ssh/tsakai ((nil))
>>>> debug3: Wrote 64 bytes for a total of 1109
>>>> debug1: Authentications that can continue: publickey
>>>> debug3: start over, passed a different list publickey
>>>> debug3: preferred gssapi-with-mic,publickey
>>>> debug3: authmethod_lookup publickey
>>>> debug3: remaining preferred: ,publickey
>>>> debug3: authmethod_is_enabled publickey
>>>> debug1: Next authentication method: publickey
>>>> debug1: Trying private key: /home/tsakai/.ssh/tsakai
>>>> debug1: read PEM private key done: type RSA
>>>> debug3: sign_and_send_pubkey
>>>> debug2: we sent a publickey packet, wait for reply
>>>> debug3: Wrote 384 bytes for a total of 1493
>>>> debug1: Authentication succeeded (publickey).
>>>> debug2: fd 4 setting O_NONBLOCK
>>>> debug1: channel 0: new [client-session]
>>>> debug3: ssh_session2_open: channel_new: 0
>>>> debug2: channel 0: send open
>>>> debug1: Requesting no-more-sessions_at_[hidden]
>>>> debug1: Entering interactive session.
>>>> debug3: Wrote 128 bytes for a total of 1621
>>>> debug2: callback start
>>>> debug2: client_session2_setup: id 0
>>>> debug1: Sending environment.
>>>> debug3: Ignored env HOSTNAME
>>>> debug3: Ignored env TERM
>>>> debug3: Ignored env SHELL
>>>> debug3: Ignored env HISTSIZE
>>>> debug3: Ignored env EC2_AMITOOL_HOME
>>>> debug3: Ignored env SSH_CLIENT
>>>> debug3: Ignored env SSH_TTY
>>>> debug3: Ignored env USER
>>>> debug3: Ignored env LD_LIBRARY_PATH
>>>> debug3: Ignored env LS_COLORS
>>>> debug3: Ignored env EC2_HOME
>>>> debug3: Ignored env MAIL
>>>> debug3: Ignored env PATH
>>>> debug3: Ignored env INPUTRC
>>>> debug3: Ignored env PWD
>>>> debug3: Ignored env JAVA_HOME
>>>> debug1: Sending env LANG = en_US.UTF-8
>>>> debug2: channel 0: request env confirm 0
>>>> debug3: Ignored env AWS_CLOUDWATCH_HOME
>>>> debug3: Ignored env AWS_IAM_HOME
>>>> debug3: Ignored env SHLVL
>>>> debug3: Ignored env HOME
>>>> debug3: Ignored env AWS_PATH
>>>> debug3: Ignored env AWS_AUTO_SCALING_HOME
>>>> debug3: Ignored env LOGNAME
>>>> debug3: Ignored env AWS_ELB_HOME
>>>> debug3: Ignored env SSH_CONNECTION
>>>> debug3: Ignored env LESSOPEN
>>>> debug3: Ignored env AWS_RDS_HOME
>>>> debug3: Ignored env G_BROKEN_FILENAMES
>>>> debug3: Ignored env _
>>>> debug3: Ignored env OLDPWD
>>>> debug3: Ignored env OMPI_MCA_plm
>>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>>>> debug2: channel 0: request exec confirm 1
>>>> debug2: fd 3 setting TCP_NODELAY
>>>> debug2: callback done
>>>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>>>> debug3: Wrote 272 bytes for a total of 1893
>>>> debug2: channel 0: rcvd adjust 2097152
>>>> debug2: channel_input_status_confirm: type 99 id 0
>>>> debug2: exec request accepted on channel 0
>>>> debug2: channel 0: read<=0 rfd 4 len 0
>>>> debug2: channel 0: read failed
>>>> debug2: channel 0: close_read
>>>> debug2: channel 0: input open -> drain
>>>> debug2: channel 0: ibuf empty
>>>> debug2: channel 0: send eof
>>>> debug2: channel 0: input drain -> closed
>>>> debug3: Wrote 32 bytes for a total of 1925
>>>> debug2: channel 0: rcvd eof
>>>> debug2: channel 0: output open -> drain
>>>> debug2: channel 0: obuf empty
>>>> debug2: channel 0: close_write
>>>> debug2: channel 0: output drain -> closed
>>>> debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
>>>> debug2: channel 0: rcvd close
>>>> debug3: channel 0: will not send data after close
>>>> debug2: channel 0: almost dead
>>>> debug2: channel 0: gc: notify user
>>>> debug2: channel 0: gc: user detached
>>>> debug2: channel 0: send close
>>>> debug2: channel 0: is dead
>>>> debug2: channel 0: garbage collecting
>>>> debug1: channel 0: free: client-session, nchannels 1
>>>> debug3: channel 0: status: The following connections are open:
>>>> #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
>>>>
>>>> debug3: channel 0: close_fds r -1 w -1 e 6 c -1
>>>> debug3: Wrote 32 bytes for a total of 1957
>>>> debug3: Wrote 64 bytes for a total of 2021
>>>> debug1: fd 0 clearing O_NONBLOCK
>>>> Transferred: sent 1840, received 1896 bytes, in 0.1 seconds
>>>> Bytes per second: sent 18384.8, received 18944.3
>>>> debug1: Exit status 0
>>>> # it is hanging; I am about to issue control-C
>>>> ^Cmpirun: killing job...
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>> below. Additional manual cleanup may be required - please refer to
>>>> the "orte-clean" tool for assistance.
>>>>
>>>> --------------------------------------------------------------------------
>>>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>>>> back when launched
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e.,
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # daemon did not report back when
>>>> launched
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # what does that mean?
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming...
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # I give up
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ exit
>>>> logout
>>>> [tsakai_at_vixen ec2]$
>>>> [tsakai_at_vixen ec2]$
>>>>
>>>> Do you see anything strange?
>>>>
>>>> One final question: On ssh man page, it mentions a few environmental
>>>> varialbles. SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc. Do
>>>> any of these matter as far as openMPI is concerned?
>>>>
>>>> Thank you, Gus.
>>>>
>>>> Regards,
>>>>
>>>> Tena
>>>>
>>>> On 2/15/11 5:09 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>>
>>>>> Tena Sakai wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to reproduce what I was able to show last Friday on Amazon
>>>>>> EC2 instances, but I am having a problem. What I was able to show last
>>>>>> Friday as root was with this command:
>>>>>> mpirun ­app app.ac
>>>>>> with app.ac being:
>>>>>> -H dns-entry-A ­np 1 (linux command)
>>>>>> -H dns-entry-A ­np 1 (linux command)
>>>>>> -H dns-entry-B ­np 1 (linux command)
>>>>>> -H dns-entry-B ­np 1 (linux command)
>>>>>>
>>>>>> Here¹s the config file in root¹s .ssh directory:
>>>>>> Host *
>>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>>> IdentitiesOnly yes
>>>>>> BatchMode yes
>>>>>>
>>>>>> Yesterday and today I can¹t get this to work. I made the last part of
>>>>>> app.ac
>>>>>> file simpler (it now says /bin/hostname). Below is the session:
>>>>>>
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# # I am on instance A, host name for inst A is:
>>>>>> -bash-3.2# hostname
>>>>>> domU-12-31-39-09-CD-C2
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# nslookup domU-12-31-39-09-CD-C2
>>>>>> Server: 172.16.0.23
>>>>>> Address: 172.16.0.23#53
>>>>>>
>>>>>> Non-authoritative answer:
>>>>>> Name: domU-12-31-39-09-CD-C2.compute-1.internal
>>>>>> Address: 10.210.210.48
>>>>>>
>>>>>> -bash-3.2# cd .ssh
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# cat config
>>>>>> Host *
>>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>>> IdentitiesOnly yes
>>>>>> BatchMode yes
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# ll config
>>>>>> -rw-r--r-- 1 root root 103 Feb 15 17:18 config
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# chmod 600 config
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# # show I can go to inst B without password/passphrase
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal
>>>>>> Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# hostname
>>>>>> domU-12-31-39-09-E6-71
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# nslookup `hostname`
>>>>>> Server: 172.16.0.23
>>>>>> Address: 172.16.0.23#53
>>>>>>
>>>>>> Non-authoritative answer:
>>>>>> Name: domU-12-31-39-09-E6-71.compute-1.internal
>>>>>> Address: 10.210.233.123
>>>>>>
>>>>>> -bash-3.2# # and back to inst A is also no problem
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal
>>>>>> Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# hostname
>>>>>> domU-12-31-39-09-CD-C2
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# # log out twice to go back to inst A
>>>>>> -bash-3.2# exit
>>>>>> logout
>>>>>> Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed.
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# exit
>>>>>> logout
>>>>>> Connection to domU-12-31-39-09-E6-71.compute-1.internal closed.
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# hostname
>>>>>> domU-12-31-39-09-CD-C2
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# cd ..
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# pwd
>>>>>> /root
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# ll
>>>>>> total 8
>>>>>> -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac
>>>>>> -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# cat app.ac
>>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs
>>>>>> -bash-3.2# mpirun -app app.ac
>>>>>> mpirun: killing job...
>>>>>>
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>> the "orte-clean" tool for assistance.
>>>>>>
>>>>>>
------------------------------------------------------------------------->>>>>>
-
>>>>>> domU-12-31-39-09-E6-71.compute-1.internal - daemon did not
>>>>>> report back when launched
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# cat app.ac2
>>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# # when there is no remote machine, then mpirun works:
>>>>>> -bash-3.2# mpirun -app app.ac2
>>>>>> domU-12-31-39-09-CD-C2
>>>>>> domU-12-31-39-09-CD-C2
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# hostname
>>>>>> domU-12-31-39-09-CD-C2
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# # this gotta be ssh problem....
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# # show no firewall is used
>>>>>> -bash-3.2# iptables --list
>>>>>> Chain INPUT (policy ACCEPT)
>>>>>> target prot opt source destination
>>>>>>
>>>>>> Chain FORWARD (policy ACCEPT)
>>>>>> target prot opt source destination
>>>>>>
>>>>>> Chain OUTPUT (policy ACCEPT)
>>>>>> target prot opt source destination
>>>>>> -bash-3.2#
>>>>>> -bash-3.2# exit
>>>>>> logout
>>>>>> [tsakai_at_vixen ec2]$
>>>>>>
>>>>>> Would someone please point out what I am doing wrong?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tena
>>>>>>
>>>>> Hi Tena
>>>>>
>>>>> Nothing wrong that I can see.
>>>>> Just another couple of suggestions,
>>>>> based on somewhat vague possibilities.
>>>>>
>>>>> A slight difference is that on vixen and dashen you ran the
>>>>> MPI hostname tests as a regular user, not as root, right?
>>>>> Not sure if this will make much of a difference,
>>>>> but it may be worth trying to run it as a regular user in EC2 also.
>>>>> I general most people avoid running user applications (MPI programs
>>>>> included) as root.
>>>>> Mostly for safety, but I wonder if there are any
>>>>> implications in the 'rootly powers'
>>>>> regarding the under-the-hood processes that OpenMPI
>>>>> launches along with the actual user programs.
>>>>>
>>>>> This may make no difference either,
>>>>> but you could do a 'service iptables status',
>>>>> to see if the service is running, even though there are
>>>>> no explicit iptable rules (as per your email).
>>>>> If the service is not running you get
>>>>> 'Firewall is stopped.' (in CentOS).
>>>>> I *think* 'iptables --list' loads the iptables module into the
>>>>> kernel, as a side effect, whereas the service command does not.
>>>>> So, it may be cleaner (safer?) to use the service version
>>>>> instead of 'iptables --list'.
>>>>> I don't know if it will make any difference,
>>>>> but just in case, if the service is running,
>>>>> why not do 'service iptables stop',
>>>>> and perhaps also 'chkconfig iptables off' to be completely
>>>>> free of iptables?
>>>>>
>>>>> Gus Correa
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users