Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] This must be ssh problem, but I can't figure out what it is...
From: Gus Correa (gus_at_[hidden])
Date: 2011-02-16 19:31:13


Hi Tena

Again, I think your EC2 session log with ssh debug3 level (below)
should be looked at by somebody more knowledgeable in OpenMPI
and in ssh that me.
There must be some clue to what is going on there.

Ssh experts, Jeff, Ralph, please help!

Anyway ...
AFAIK, 'orted' in the first line you selected/highlighted below,
is the 'Openmpi Run Time Environment Daemon' ( ... the OpenMPI pros
are authorized to send me to the galleys if it is not ...).
So, orted is trying to do its thing, to create the conditions for your
job to run across the two EC2 'instances'. (Gone are the naive
days when these things were computers, each one on its box ...)
This master or ceremonies' work of orted is done via tcp, and I guess
10.96.118.236 is the IP (of computer B?),
and 56064 is probably the port,
where orted may be trying to open a socket.
The bunch of -mca parameters are just what they are: MCA parameters
(MCA=Modular Component Architecture of OpenMPI, and here I am risking to
be shanghaied or ridiculed again ...).
(You can learn more about the mca parameters with 'ompi_info -help'.)
That is how in my ignorance I parse that line.

So, from the computer/instance-A side orted gives the first kick,
but somehow the ball never comes back from computer/instance-B.
It's ping- without -pong.
The same frustrating feeling I had when I was a kid and kicked the
soccer ball on the neighbor's side and would never see it again.

Cheers,
Gus

Tena Sakai wrote:
> Hi Gus,
>
> Thank you for your reply and suggestions.
>
> I will follow up on these in a bit and will give you an
> update. Looking at what vixen and/or dasher generates
> from DEBUG3 would be interesting.
>
> For now, may I point out something I noticed out of the
> DEBUG3 Output last night?
>
> I found this line:
>
>> debug1: Sending command: orted --daemonize -mca ess env -mca
>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>
> Followed by:
>
>> debug2: channel 0: request exec confirm 1
>> debug2: fd 3 setting TCP_NODELAY
>> debug2: callback done
>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>> debug3: Wrote 272 bytes for a total of 1893
>> debug2: channel 0: rcvd adjust 2097152
>> debug2: channel_input_status_confirm: type 99 id 0
>
> It appears, to my untrained eye/mind, a directive from instance A
> to B was issued and then what happened? I don't see that was
> honored by the instance B.
>
> Can you please comment on this?
>
> Thank you.
>
> Regards,
>
> Tena
>
> On 2/16/11 1:34 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>
>> Hi Tena
>>
>> I hope somebody more knowledgeable in ssh
>> takes a look at the debug3 session log that you included.
>>
>> I can't see if/where/why ssh is failing for you in EC2.
>>
>> See other answers inline, please.
>>
>> Tena Sakai wrote:
>>> Hi Gus,
>>>
>>> Thank you again for your reply.
>>>
>>>> A slight difference is that on vixen and dashen you ran the
>>>> MPI hostname tests as a regular user, not as root, right?
>>>> Not sure if this will make much of a difference,
>>>> but it may be worth trying to run it as a regular user in EC2 also.
>>>> I general most people avoid running user applications (MPI programs
>>>> included) as root.
>>>> Mostly for safety, but I wonder if there are any
>>>> implications in the 'rootly powers'
>>>> regarding the under-the-hood processes that OpenMPI
>>>> launches along with the actual user programs.
>>> Yes, between vixen and dahser I was doing the test as user tsakai,
>>> not as root. But the reason I wanted to do this test as root is
>>> to show that it fails as regular user (generating pipe system
>>> call failed error), whereas as root it would succeed, as it did
>>> on Friday.
>> Sorry again.
>> I even wrote "root can and Tena cannot", then I forgot.
>> Too many tasks at the same time, too much context-switching ...
>>
>>> The ami has not changed. The last change on the ami
>>> was last Tuesday. As such I don't understand this inconsistent
>>> behavior. I have lots of notes from previous sessions and I
>>> consulted different successful session logs to replicate what I
>>> saw Friday, but with no success.
>>>
>>> Having spent days and not getting anywhere, I decided to take a
>>> different approach. I instantiated a linux ami that was built by
>>> Amazon, which feels like centos/fedora-based. I downloaded gcc
>>> and c++, plus openMPI 1.4.3. After I got openMPI running, I
>>> created an account for user tsakai, uploaded my public key, re-logged
>>> in as user tsakai, and ran the same test. Surprisingly (or not?) it
>>> generated the same result. I.e., I cannot run the same mpirun
>>> command when there is a remote instance involved, but on itself
>>> mpirun runs fine. So, I am feeling that this has to be an ssh
>>> authentication problem. I looked at man page for ssh and ssh_config
>>> and cannot figure out what I am doing wrong. I put in "LogLevel
>>> DEBUG3" line and it generated lots of lines, in which I found a
>>> line:
>>> debug1: Authentication succeeded (publickey).
>>> Then I see a bunch of lines that look like:
>>> debug3: Ignored env XXXXXXX
>>> and mpirun hangs. Here is the session log:
>>>
>> Ssh on our clusters uses host-based authentication.
>> I think Reuti sent you his page about it:
>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>>
>> However, I believe OpenMPI shouldn't care which ssh authentication
>> mechanism is used, as long as it works passwordless.
>>
>> As for ssh configuration, ours is pretty standard:
>>
>> 1) We don't have 'IdentitiesOnly yes' (default is 'no'),
>> but use standard identity file names id_rsa, etc.
>> I think you are just telling ssh to use the specific identity
>> file you named.
>> I don't know if this may cause the problem, but who knows?
>>
>> 2) We don't have 'BatchMode yes' set.
>>
>> 3) We have the GSS authentication set
>>
>> GSSAPIAuthentication yes
>>
>> 4) The locale environment variables are also passed
>> (may not be crucial):
>>
>> SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY
>> LC_MESSAGES
>> SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
>> SendEnv LC_IDENTIFICATION LC_ALL
>>
>> 5) And X forwarding (you're not doing any X stuff, I suppose):
>>
>> ForwardX11Trusted yes
>>
>> 6) However, you may want to check what is in your
>> /etc/ssh/ssh_config and /etc/ssh/sshd_config,
>> because some options may be already set there.
>>
>> 7) Take a look at 'man ssh[d]' and 'man ssh[d]_config' too.
>>
>> ***
>>
>> Finally, if you are willing to, it may be worth to run the same
>> experiment (with debug3) on vixen @ dashen, just to compare what
>> comes out from the verbose ssh messages to what you see in EC2.
>> Perhaps it may help nail down the reason for failure.
>>
>> Gus Correa
>>
>>
>>
>>> [tsakai_at_vixen ec2]$
>>> [tsakai_at_vixen ec2]$ ssh -i $MYKEY
>>> tsakai_at_[hidden]
>>> Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1
>>>
>>> __| __|_ ) Amazon Linux AMI
>>> _| ( / Beta
>>> ___|\___|___|
>>>
>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>> :-)
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # show firewall is off
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ service iptables status
>>> -bash: service: command not found
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ sudo service iptables status
>>> iptables: Firewall is not running.
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no
>>> password authentication
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ ssh
>>> domU-12-31-39-16-4E-4C.compute-1.internal
>>> Last login: Wed Feb 16 06:53:14 2011 from
>>> domu-12-31-39-16-75-1e.compute-1.internal
>>>
>>> __| __|_ ) Amazon Linux AMI
>>> _| ( / Beta
>>> ___|\___|___|
>>>
>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>> :-)
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # also back to inst A
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ ssh
>>> domU-12-31-39-16-75-1E.compute-1.internal
>>> Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1
>>>
>>> __| __|_ ) Amazon Linux AMI
>>> _| ( / Beta
>>> ___|\___|___|
>>>
>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>>> :-)
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # OK
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # back to inst B
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ exit
>>> logout
>>> Connection to domU-12-31-39-16-75-1E.compute-1.internal closed.
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB
>>> LD_LIBRARY_PATH=:/usr/local/lib
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ sudo service iptables status
>>> iptables: Firewall is not running.
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # go back to inst A
>>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ exit
>>> logout
>>> Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed.
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB
>>> LD_LIBRARY_PATH=:/usr/local/lib
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cat app.ac
>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine);
>>> bottom 2 are remote inst (inst B)
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>> ^Cmpirun: killing job...
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --------------------------------------------------------------------------
>>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>>> back when launched
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when
>>> launched ***
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cat app.ac2
>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A)
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2
>>> domU-12-31-39-16-75-1E
>>> domU-12-31-39-16-75-1E
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # that's no problem
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cd .ssh
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>>> Host *
>>> IdentityFile /home/tsakai/.ssh/tsakai
>>> IdentitiesOnly yes
>>> BatchMode yes
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ mv config config.svd
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ ll config
>>> -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ chmod 600 config
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>>> Host *
>>> IdentityFile /home/tsakai/.ssh/tsakai
>>> IdentitiesOnly yes
>>> BatchMode yes
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat - >> config
>>> LogLevel DEBUG3
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>>> Host *
>>> IdentityFile /home/tsakai/.ssh/tsakai
>>> IdentitiesOnly yes
>>> BatchMode yes
>>> LogLevel DEBUG3
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ ll config
>>> -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cd ..
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>> debug2: ssh_connect: needpriv 0
>>> debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal
>>> [10.96.77.182] port 22.
>>> debug1: Connection established.
>>> debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai.
>>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>>> debug3: key_read: missing keytype
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug3: key_read: missing whitespace
>>> debug2: key_type_from_name: unknown key type '-----END'
>>> debug3: key_read: missing keytype
>>> debug1: identity file /home/tsakai/.ssh/tsakai type -1
>>> debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
>>> debug1: match: OpenSSH_5.3 pat OpenSSH*
>>> debug1: Enabling compatibility mode for protocol 2.0
>>> debug1: Local version string SSH-2.0-OpenSSH_5.3
>>> debug2: fd 3 setting O_NONBLOCK
>>> debug1: SSH2_MSG_KEXINIT sent
>>> debug3: Wrote 792 bytes for a total of 813
>>> debug1: SSH2_MSG_KEXINIT received
>>> debug2: kex_parse_kexinit:
>>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>> debug2: kex_parse_kexinit:
>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>>> iu.se
>>> debug2: kex_parse_kexinit:
>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>>> iu.se
>>> debug2: kex_parse_kexinit:
>>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>>> .com,hmac-sha1-96,hmac-md5-96
>>> debug2: kex_parse_kexinit:
>>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>>> .com,hmac-sha1-96,hmac-md5-96
>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden],zlib
>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden],zlib
>>> debug2: kex_parse_kexinit:
>>> debug2: kex_parse_kexinit:
>>> debug2: kex_parse_kexinit: first_kex_follows 0
>>> debug2: kex_parse_kexinit: reserved 0
>>> debug2: kex_parse_kexinit:
>>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>> debug2: kex_parse_kexinit:
>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>>> iu.se
>>> debug2: kex_parse_kexinit:
>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>>> iu.se
>>> debug2: kex_parse_kexinit:
>>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>>> .com,hmac-sha1-96,hmac-md5-96
>>> debug2: kex_parse_kexinit:
>>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>>> .com,hmac-sha1-96,hmac-md5-96
>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden]
>>> debug2: kex_parse_kexinit: none,zlib_at_[hidden]
>>> debug2: kex_parse_kexinit:
>>> debug2: kex_parse_kexinit:
>>> debug2: kex_parse_kexinit: first_kex_follows 0
>>> debug2: kex_parse_kexinit: reserved 0
>>> debug2: mac_setup: found hmac-md5
>>> debug1: kex: server->client aes128-ctr hmac-md5 none
>>> debug2: mac_setup: found hmac-md5
>>> debug1: kex: client->server aes128-ctr hmac-md5 none
>>> debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
>>> debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
>>> debug3: Wrote 24 bytes for a total of 837
>>> debug2: dh_gen_key: priv key bits set: 125/256
>>> debug2: bits set: 489/1024
>>> debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
>>> debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
>>> debug3: Wrote 144 bytes for a total of 981
>>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>> debug3: check_host_in_hostfile: match line 1
>>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>> debug3: check_host_in_hostfile: match line 1
>>> debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and
>>> matches the RSA host key.
>>> debug1: Found key in /home/tsakai/.ssh/known_hosts:1
>>> debug2: bits set: 491/1024
>>> debug1: ssh_rsa_verify: signature correct
>>> debug2: kex_derive_keys
>>> debug2: set_newkeys: mode 1
>>> debug1: SSH2_MSG_NEWKEYS sent
>>> debug1: expecting SSH2_MSG_NEWKEYS
>>> debug3: Wrote 16 bytes for a total of 997
>>> debug2: set_newkeys: mode 0
>>> debug1: SSH2_MSG_NEWKEYS received
>>> debug1: SSH2_MSG_SERVICE_REQUEST sent
>>> debug3: Wrote 48 bytes for a total of 1045
>>> debug2: service_accept: ssh-userauth
>>> debug1: SSH2_MSG_SERVICE_ACCEPT received
>>> debug2: key: /home/tsakai/.ssh/tsakai ((nil))
>>> debug3: Wrote 64 bytes for a total of 1109
>>> debug1: Authentications that can continue: publickey
>>> debug3: start over, passed a different list publickey
>>> debug3: preferred gssapi-with-mic,publickey
>>> debug3: authmethod_lookup publickey
>>> debug3: remaining preferred: ,publickey
>>> debug3: authmethod_is_enabled publickey
>>> debug1: Next authentication method: publickey
>>> debug1: Trying private key: /home/tsakai/.ssh/tsakai
>>> debug1: read PEM private key done: type RSA
>>> debug3: sign_and_send_pubkey
>>> debug2: we sent a publickey packet, wait for reply
>>> debug3: Wrote 384 bytes for a total of 1493
>>> debug1: Authentication succeeded (publickey).
>>> debug2: fd 4 setting O_NONBLOCK
>>> debug1: channel 0: new [client-session]
>>> debug3: ssh_session2_open: channel_new: 0
>>> debug2: channel 0: send open
>>> debug1: Requesting no-more-sessions_at_[hidden]
>>> debug1: Entering interactive session.
>>> debug3: Wrote 128 bytes for a total of 1621
>>> debug2: callback start
>>> debug2: client_session2_setup: id 0
>>> debug1: Sending environment.
>>> debug3: Ignored env HOSTNAME
>>> debug3: Ignored env TERM
>>> debug3: Ignored env SHELL
>>> debug3: Ignored env HISTSIZE
>>> debug3: Ignored env EC2_AMITOOL_HOME
>>> debug3: Ignored env SSH_CLIENT
>>> debug3: Ignored env SSH_TTY
>>> debug3: Ignored env USER
>>> debug3: Ignored env LD_LIBRARY_PATH
>>> debug3: Ignored env LS_COLORS
>>> debug3: Ignored env EC2_HOME
>>> debug3: Ignored env MAIL
>>> debug3: Ignored env PATH
>>> debug3: Ignored env INPUTRC
>>> debug3: Ignored env PWD
>>> debug3: Ignored env JAVA_HOME
>>> debug1: Sending env LANG = en_US.UTF-8
>>> debug2: channel 0: request env confirm 0
>>> debug3: Ignored env AWS_CLOUDWATCH_HOME
>>> debug3: Ignored env AWS_IAM_HOME
>>> debug3: Ignored env SHLVL
>>> debug3: Ignored env HOME
>>> debug3: Ignored env AWS_PATH
>>> debug3: Ignored env AWS_AUTO_SCALING_HOME
>>> debug3: Ignored env LOGNAME
>>> debug3: Ignored env AWS_ELB_HOME
>>> debug3: Ignored env SSH_CONNECTION
>>> debug3: Ignored env LESSOPEN
>>> debug3: Ignored env AWS_RDS_HOME
>>> debug3: Ignored env G_BROKEN_FILENAMES
>>> debug3: Ignored env _
>>> debug3: Ignored env OLDPWD
>>> debug3: Ignored env OMPI_MCA_plm
>>> debug1: Sending command: orted --daemonize -mca ess env -mca
>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>>> debug2: channel 0: request exec confirm 1
>>> debug2: fd 3 setting TCP_NODELAY
>>> debug2: callback done
>>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>>> debug3: Wrote 272 bytes for a total of 1893
>>> debug2: channel 0: rcvd adjust 2097152
>>> debug2: channel_input_status_confirm: type 99 id 0
>>> debug2: exec request accepted on channel 0
>>> debug2: channel 0: read<=0 rfd 4 len 0
>>> debug2: channel 0: read failed
>>> debug2: channel 0: close_read
>>> debug2: channel 0: input open -> drain
>>> debug2: channel 0: ibuf empty
>>> debug2: channel 0: send eof
>>> debug2: channel 0: input drain -> closed
>>> debug3: Wrote 32 bytes for a total of 1925
>>> debug2: channel 0: rcvd eof
>>> debug2: channel 0: output open -> drain
>>> debug2: channel 0: obuf empty
>>> debug2: channel 0: close_write
>>> debug2: channel 0: output drain -> closed
>>> debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
>>> debug2: channel 0: rcvd close
>>> debug3: channel 0: will not send data after close
>>> debug2: channel 0: almost dead
>>> debug2: channel 0: gc: notify user
>>> debug2: channel 0: gc: user detached
>>> debug2: channel 0: send close
>>> debug2: channel 0: is dead
>>> debug2: channel 0: garbage collecting
>>> debug1: channel 0: free: client-session, nchannels 1
>>> debug3: channel 0: status: The following connections are open:
>>> #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
>>>
>>> debug3: channel 0: close_fds r -1 w -1 e 6 c -1
>>> debug3: Wrote 32 bytes for a total of 1957
>>> debug3: Wrote 64 bytes for a total of 2021
>>> debug1: fd 0 clearing O_NONBLOCK
>>> Transferred: sent 1840, received 1896 bytes, in 0.1 seconds
>>> Bytes per second: sent 18384.8, received 18944.3
>>> debug1: Exit status 0
>>> # it is hanging; I am about to issue control-C
>>> ^Cmpirun: killing job...
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --------------------------------------------------------------------------
>>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>>> back when launched
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e.,
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # daemon did not report back when
>>> launched
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # what does that mean?
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming...
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # I give up
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ exit
>>> logout
>>> [tsakai_at_vixen ec2]$
>>> [tsakai_at_vixen ec2]$
>>>
>>> Do you see anything strange?
>>>
>>> One final question: On ssh man page, it mentions a few environmental
>>> varialbles. SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc. Do
>>> any of these matter as far as openMPI is concerned?
>>>
>>> Thank you, Gus.
>>>
>>> Regards,
>>>
>>> Tena
>>>
>>> On 2/15/11 5:09 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>>
>>>> Tena Sakai wrote:
>>>>> Hi,
>>>>>
>>>>> I am trying to reproduce what I was able to show last Friday on Amazon
>>>>> EC2 instances, but I am having a problem. What I was able to show last
>>>>> Friday as root was with this command:
>>>>> mpirun ­app app.ac
>>>>> with app.ac being:
>>>>> -H dns-entry-A ­np 1 (linux command)
>>>>> -H dns-entry-A ­np 1 (linux command)
>>>>> -H dns-entry-B ­np 1 (linux command)
>>>>> -H dns-entry-B ­np 1 (linux command)
>>>>>
>>>>> Here¹s the config file in root¹s .ssh directory:
>>>>> Host *
>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>> IdentitiesOnly yes
>>>>> BatchMode yes
>>>>>
>>>>> Yesterday and today I can¹t get this to work. I made the last part of
>>>>> app.ac
>>>>> file simpler (it now says /bin/hostname). Below is the session:
>>>>>
>>>>> -bash-3.2#
>>>>> -bash-3.2# # I am on instance A, host name for inst A is:
>>>>> -bash-3.2# hostname
>>>>> domU-12-31-39-09-CD-C2
>>>>> -bash-3.2#
>>>>> -bash-3.2# nslookup domU-12-31-39-09-CD-C2
>>>>> Server: 172.16.0.23
>>>>> Address: 172.16.0.23#53
>>>>>
>>>>> Non-authoritative answer:
>>>>> Name: domU-12-31-39-09-CD-C2.compute-1.internal
>>>>> Address: 10.210.210.48
>>>>>
>>>>> -bash-3.2# cd .ssh
>>>>> -bash-3.2#
>>>>> -bash-3.2# cat config
>>>>> Host *
>>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>>> IdentitiesOnly yes
>>>>> BatchMode yes
>>>>> -bash-3.2#
>>>>> -bash-3.2# ll config
>>>>> -rw-r--r-- 1 root root 103 Feb 15 17:18 config
>>>>> -bash-3.2#
>>>>> -bash-3.2# chmod 600 config
>>>>> -bash-3.2#
>>>>> -bash-3.2# # show I can go to inst B without password/passphrase
>>>>> -bash-3.2#
>>>>> -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal
>>>>> Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48
>>>>> -bash-3.2#
>>>>> -bash-3.2# hostname
>>>>> domU-12-31-39-09-E6-71
>>>>> -bash-3.2#
>>>>> -bash-3.2# nslookup `hostname`
>>>>> Server: 172.16.0.23
>>>>> Address: 172.16.0.23#53
>>>>>
>>>>> Non-authoritative answer:
>>>>> Name: domU-12-31-39-09-E6-71.compute-1.internal
>>>>> Address: 10.210.233.123
>>>>>
>>>>> -bash-3.2# # and back to inst A is also no problem
>>>>> -bash-3.2#
>>>>> -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal
>>>>> Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1
>>>>> -bash-3.2#
>>>>> -bash-3.2# hostname
>>>>> domU-12-31-39-09-CD-C2
>>>>> -bash-3.2#
>>>>> -bash-3.2# # log out twice to go back to inst A
>>>>> -bash-3.2# exit
>>>>> logout
>>>>> Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed.
>>>>> -bash-3.2#
>>>>> -bash-3.2# exit
>>>>> logout
>>>>> Connection to domU-12-31-39-09-E6-71.compute-1.internal closed.
>>>>> -bash-3.2#
>>>>> -bash-3.2# hostname
>>>>> domU-12-31-39-09-CD-C2
>>>>> -bash-3.2#
>>>>> -bash-3.2# cd ..
>>>>> -bash-3.2#
>>>>> -bash-3.2# pwd
>>>>> /root
>>>>> -bash-3.2#
>>>>> -bash-3.2# ll
>>>>> total 8
>>>>> -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac
>>>>> -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2
>>>>> -bash-3.2#
>>>>> -bash-3.2# cat app.ac
>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>> -bash-3.2#
>>>>> -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs
>>>>> -bash-3.2# mpirun -app app.ac
>>>>> mpirun: killing job...
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>> below. Additional manual cleanup may be required - please refer to
>>>>> the "orte-clean" tool for assistance.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> domU-12-31-39-09-E6-71.compute-1.internal - daemon did not
>>>>> report back when launched
>>>>> -bash-3.2#
>>>>> -bash-3.2# cat app.ac2
>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>> -bash-3.2#
>>>>> -bash-3.2# # when there is no remote machine, then mpirun works:
>>>>> -bash-3.2# mpirun -app app.ac2
>>>>> domU-12-31-39-09-CD-C2
>>>>> domU-12-31-39-09-CD-C2
>>>>> -bash-3.2#
>>>>> -bash-3.2# hostname
>>>>> domU-12-31-39-09-CD-C2
>>>>> -bash-3.2#
>>>>> -bash-3.2# # this gotta be ssh problem....
>>>>> -bash-3.2#
>>>>> -bash-3.2# # show no firewall is used
>>>>> -bash-3.2# iptables --list
>>>>> Chain INPUT (policy ACCEPT)
>>>>> target prot opt source destination
>>>>>
>>>>> Chain FORWARD (policy ACCEPT)
>>>>> target prot opt source destination
>>>>>
>>>>> Chain OUTPUT (policy ACCEPT)
>>>>> target prot opt source destination
>>>>> -bash-3.2#
>>>>> -bash-3.2# exit
>>>>> logout
>>>>> [tsakai_at_vixen ec2]$
>>>>>
>>>>> Would someone please point out what I am doing wrong?
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tena
>>>>>
>>>> Hi Tena
>>>>
>>>> Nothing wrong that I can see.
>>>> Just another couple of suggestions,
>>>> based on somewhat vague possibilities.
>>>>
>>>> A slight difference is that on vixen and dashen you ran the
>>>> MPI hostname tests as a regular user, not as root, right?
>>>> Not sure if this will make much of a difference,
>>>> but it may be worth trying to run it as a regular user in EC2 also.
>>>> I general most people avoid running user applications (MPI programs
>>>> included) as root.
>>>> Mostly for safety, but I wonder if there are any
>>>> implications in the 'rootly powers'
>>>> regarding the under-the-hood processes that OpenMPI
>>>> launches along with the actual user programs.
>>>>
>>>> This may make no difference either,
>>>> but you could do a 'service iptables status',
>>>> to see if the service is running, even though there are
>>>> no explicit iptable rules (as per your email).
>>>> If the service is not running you get
>>>> 'Firewall is stopped.' (in CentOS).
>>>> I *think* 'iptables --list' loads the iptables module into the
>>>> kernel, as a side effect, whereas the service command does not.
>>>> So, it may be cleaner (safer?) to use the service version
>>>> instead of 'iptables --list'.
>>>> I don't know if it will make any difference,
>>>> but just in case, if the service is running,
>>>> why not do 'service iptables stop',
>>>> and perhaps also 'chkconfig iptables off' to be completely
>>>> free of iptables?
>>>>
>>>> Gus Correa
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users