Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] This must be ssh problem, but I can't figure out what it is...
From: Tena Sakai (tsakai_at_[hidden])
Date: 2011-02-16 18:17:34


Hi Gus,

Thank you for your reply and suggestions.

I will follow up on these in a bit and will give you an
update. Looking at what vixen and/or dasher generates
from DEBUG3 would be interesting.

For now, may I point out something I noticed out of the
DEBUG3 Output last night?

I found this line:

> debug1: Sending command: orted --daemonize -mca ess env -mca
> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"

Followed by:

> debug2: channel 0: request exec confirm 1
> debug2: fd 3 setting TCP_NODELAY
> debug2: callback done
> debug2: channel 0: open confirm rwindow 0 rmax 32768
> debug3: Wrote 272 bytes for a total of 1893
> debug2: channel 0: rcvd adjust 2097152
> debug2: channel_input_status_confirm: type 99 id 0

It appears, to my untrained eye/mind, a directive from instance A
to B was issued and then what happened? I don't see that was
honored by the instance B.

Can you please comment on this?

Thank you.

Regards,

Tena

On 2/16/11 1:34 PM, "Gus Correa" <gus_at_[hidden]> wrote:

> Hi Tena
>
> I hope somebody more knowledgeable in ssh
> takes a look at the debug3 session log that you included.
>
> I can't see if/where/why ssh is failing for you in EC2.
>
> See other answers inline, please.
>
> Tena Sakai wrote:
>> Hi Gus,
>>
>> Thank you again for your reply.
>>
>>> A slight difference is that on vixen and dashen you ran the
>>> MPI hostname tests as a regular user, not as root, right?
>>> Not sure if this will make much of a difference,
>>> but it may be worth trying to run it as a regular user in EC2 also.
>>> I general most people avoid running user applications (MPI programs
>>> included) as root.
>>> Mostly for safety, but I wonder if there are any
>>> implications in the 'rootly powers'
>>> regarding the under-the-hood processes that OpenMPI
>>> launches along with the actual user programs.
>>
>> Yes, between vixen and dahser I was doing the test as user tsakai,
>> not as root. But the reason I wanted to do this test as root is
>> to show that it fails as regular user (generating pipe system
>> call failed error), whereas as root it would succeed, as it did
>> on Friday.
>
> Sorry again.
> I even wrote "root can and Tena cannot", then I forgot.
> Too many tasks at the same time, too much context-switching ...
>
>> The ami has not changed. The last change on the ami
>> was last Tuesday. As such I don't understand this inconsistent
>> behavior. I have lots of notes from previous sessions and I
>> consulted different successful session logs to replicate what I
>> saw Friday, but with no success.
>>
>> Having spent days and not getting anywhere, I decided to take a
>> different approach. I instantiated a linux ami that was built by
>> Amazon, which feels like centos/fedora-based. I downloaded gcc
>> and c++, plus openMPI 1.4.3. After I got openMPI running, I
>> created an account for user tsakai, uploaded my public key, re-logged
>> in as user tsakai, and ran the same test. Surprisingly (or not?) it
>> generated the same result. I.e., I cannot run the same mpirun
>> command when there is a remote instance involved, but on itself
>> mpirun runs fine. So, I am feeling that this has to be an ssh
>> authentication problem. I looked at man page for ssh and ssh_config
>> and cannot figure out what I am doing wrong. I put in "LogLevel
>> DEBUG3" line and it generated lots of lines, in which I found a
>> line:
>> debug1: Authentication succeeded (publickey).
>> Then I see a bunch of lines that look like:
>> debug3: Ignored env XXXXXXX
>> and mpirun hangs. Here is the session log:
>>
>
> Ssh on our clusters uses host-based authentication.
> I think Reuti sent you his page about it:
> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>
> However, I believe OpenMPI shouldn't care which ssh authentication
> mechanism is used, as long as it works passwordless.
>
> As for ssh configuration, ours is pretty standard:
>
> 1) We don't have 'IdentitiesOnly yes' (default is 'no'),
> but use standard identity file names id_rsa, etc.
> I think you are just telling ssh to use the specific identity
> file you named.
> I don't know if this may cause the problem, but who knows?
>
> 2) We don't have 'BatchMode yes' set.
>
> 3) We have the GSS authentication set
>
> GSSAPIAuthentication yes
>
> 4) The locale environment variables are also passed
> (may not be crucial):
>
> SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY
> LC_MESSAGES
> SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
> SendEnv LC_IDENTIFICATION LC_ALL
>
> 5) And X forwarding (you're not doing any X stuff, I suppose):
>
> ForwardX11Trusted yes
>
> 6) However, you may want to check what is in your
> /etc/ssh/ssh_config and /etc/ssh/sshd_config,
> because some options may be already set there.
>
> 7) Take a look at 'man ssh[d]' and 'man ssh[d]_config' too.
>
> ***
>
> Finally, if you are willing to, it may be worth to run the same
> experiment (with debug3) on vixen @ dashen, just to compare what
> comes out from the verbose ssh messages to what you see in EC2.
> Perhaps it may help nail down the reason for failure.
>
> Gus Correa
>
>
>
>> [tsakai_at_vixen ec2]$
>> [tsakai_at_vixen ec2]$ ssh -i $MYKEY
>> tsakai_at_[hidden]
>> Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1
>>
>> __| __|_ ) Amazon Linux AMI
>> _| ( / Beta
>> ___|\___|___|
>>
>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>> :-)
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # show firewall is off
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ service iptables status
>> -bash: service: command not found
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ sudo service iptables status
>> iptables: Firewall is not running.
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no
>> password authentication
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ ssh
>> domU-12-31-39-16-4E-4C.compute-1.internal
>> Last login: Wed Feb 16 06:53:14 2011 from
>> domu-12-31-39-16-75-1e.compute-1.internal
>>
>> __| __|_ ) Amazon Linux AMI
>> _| ( / Beta
>> ___|\___|___|
>>
>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>> :-)
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # also back to inst A
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ ssh
>> domU-12-31-39-16-75-1E.compute-1.internal
>> Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1
>>
>> __| __|_ ) Amazon Linux AMI
>> _| ( / Beta
>> ___|\___|___|
>>
>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>> :-)
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # OK
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # back to inst B
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ exit
>> logout
>> Connection to domU-12-31-39-16-75-1E.compute-1.internal closed.
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB
>> LD_LIBRARY_PATH=:/usr/local/lib
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ sudo service iptables status
>> iptables: Firewall is not running.
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ # go back to inst A
>> [tsakai_at_domU-12-31-39-16-4E-4C ~]$ exit
>> logout
>> Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed.
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB
>> LD_LIBRARY_PATH=:/usr/local/lib
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cat app.ac
>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine);
>> bottom 2 are remote inst (inst B)
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>> ^Cmpirun: killing job...
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>> back when launched
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when
>> launched ***
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cat app.ac2
>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A)
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2
>> domU-12-31-39-16-75-1E
>> domU-12-31-39-16-75-1E
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # that's no problem
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ cd .ssh
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>> Host *
>> IdentityFile /home/tsakai/.ssh/tsakai
>> IdentitiesOnly yes
>> BatchMode yes
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ mv config config.svd
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ ll config
>> -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ chmod 600 config
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>> Host *
>> IdentityFile /home/tsakai/.ssh/tsakai
>> IdentitiesOnly yes
>> BatchMode yes
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat - >> config
>> LogLevel DEBUG3
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cat config
>> Host *
>> IdentityFile /home/tsakai/.ssh/tsakai
>> IdentitiesOnly yes
>> BatchMode yes
>> LogLevel DEBUG3
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ ll config
>> -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$
>> [tsakai_at_domU-12-31-39-16-75-1E .ssh]$ cd ..
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>> debug2: ssh_connect: needpriv 0
>> debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal
>> [10.96.77.182] port 22.
>> debug1: Connection established.
>> debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai.
>> debug2: key_type_from_name: unknown key type '-----BEGIN'
>> debug3: key_read: missing keytype
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug3: key_read: missing whitespace
>> debug2: key_type_from_name: unknown key type '-----END'
>> debug3: key_read: missing keytype
>> debug1: identity file /home/tsakai/.ssh/tsakai type -1
>> debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
>> debug1: match: OpenSSH_5.3 pat OpenSSH*
>> debug1: Enabling compatibility mode for protocol 2.0
>> debug1: Local version string SSH-2.0-OpenSSH_5.3
>> debug2: fd 3 setting O_NONBLOCK
>> debug1: SSH2_MSG_KEXINIT sent
>> debug3: Wrote 792 bytes for a total of 813
>> debug1: SSH2_MSG_KEXINIT received
>> debug2: kex_parse_kexinit:
>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>> debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>> iu.se
>> debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>> iu.se
>> debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>> .com,hmac-sha1-96,hmac-md5-96
>> debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>> .com,hmac-sha1-96,hmac-md5-96
>> debug2: kex_parse_kexinit: none,zlib_at_[hidden],zlib
>> debug2: kex_parse_kexinit: none,zlib_at_[hidden],zlib
>> debug2: kex_parse_kexinit:
>> debug2: kex_parse_kexinit:
>> debug2: kex_parse_kexinit: first_kex_follows 0
>> debug2: kex_parse_kexinit: reserved 0
>> debug2: kex_parse_kexinit:
>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>> debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>> iu.se
>> debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc_at_lysator.l
>> iu.se
>> debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>> .com,hmac-sha1-96,hmac-md5-96
>> debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac-64_at_[hidden],hmac-ripemd160,hmac-ripemd160_at_openssh
>> .com,hmac-sha1-96,hmac-md5-96
>> debug2: kex_parse_kexinit: none,zlib_at_[hidden]
>> debug2: kex_parse_kexinit: none,zlib_at_[hidden]
>> debug2: kex_parse_kexinit:
>> debug2: kex_parse_kexinit:
>> debug2: kex_parse_kexinit: first_kex_follows 0
>> debug2: kex_parse_kexinit: reserved 0
>> debug2: mac_setup: found hmac-md5
>> debug1: kex: server->client aes128-ctr hmac-md5 none
>> debug2: mac_setup: found hmac-md5
>> debug1: kex: client->server aes128-ctr hmac-md5 none
>> debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
>> debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
>> debug3: Wrote 24 bytes for a total of 837
>> debug2: dh_gen_key: priv key bits set: 125/256
>> debug2: bits set: 489/1024
>> debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
>> debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
>> debug3: Wrote 144 bytes for a total of 981
>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>> debug3: check_host_in_hostfile: match line 1
>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>> debug3: check_host_in_hostfile: match line 1
>> debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and
>> matches the RSA host key.
>> debug1: Found key in /home/tsakai/.ssh/known_hosts:1
>> debug2: bits set: 491/1024
>> debug1: ssh_rsa_verify: signature correct
>> debug2: kex_derive_keys
>> debug2: set_newkeys: mode 1
>> debug1: SSH2_MSG_NEWKEYS sent
>> debug1: expecting SSH2_MSG_NEWKEYS
>> debug3: Wrote 16 bytes for a total of 997
>> debug2: set_newkeys: mode 0
>> debug1: SSH2_MSG_NEWKEYS received
>> debug1: SSH2_MSG_SERVICE_REQUEST sent
>> debug3: Wrote 48 bytes for a total of 1045
>> debug2: service_accept: ssh-userauth
>> debug1: SSH2_MSG_SERVICE_ACCEPT received
>> debug2: key: /home/tsakai/.ssh/tsakai ((nil))
>> debug3: Wrote 64 bytes for a total of 1109
>> debug1: Authentications that can continue: publickey
>> debug3: start over, passed a different list publickey
>> debug3: preferred gssapi-with-mic,publickey
>> debug3: authmethod_lookup publickey
>> debug3: remaining preferred: ,publickey
>> debug3: authmethod_is_enabled publickey
>> debug1: Next authentication method: publickey
>> debug1: Trying private key: /home/tsakai/.ssh/tsakai
>> debug1: read PEM private key done: type RSA
>> debug3: sign_and_send_pubkey
>> debug2: we sent a publickey packet, wait for reply
>> debug3: Wrote 384 bytes for a total of 1493
>> debug1: Authentication succeeded (publickey).
>> debug2: fd 4 setting O_NONBLOCK
>> debug1: channel 0: new [client-session]
>> debug3: ssh_session2_open: channel_new: 0
>> debug2: channel 0: send open
>> debug1: Requesting no-more-sessions_at_[hidden]
>> debug1: Entering interactive session.
>> debug3: Wrote 128 bytes for a total of 1621
>> debug2: callback start
>> debug2: client_session2_setup: id 0
>> debug1: Sending environment.
>> debug3: Ignored env HOSTNAME
>> debug3: Ignored env TERM
>> debug3: Ignored env SHELL
>> debug3: Ignored env HISTSIZE
>> debug3: Ignored env EC2_AMITOOL_HOME
>> debug3: Ignored env SSH_CLIENT
>> debug3: Ignored env SSH_TTY
>> debug3: Ignored env USER
>> debug3: Ignored env LD_LIBRARY_PATH
>> debug3: Ignored env LS_COLORS
>> debug3: Ignored env EC2_HOME
>> debug3: Ignored env MAIL
>> debug3: Ignored env PATH
>> debug3: Ignored env INPUTRC
>> debug3: Ignored env PWD
>> debug3: Ignored env JAVA_HOME
>> debug1: Sending env LANG = en_US.UTF-8
>> debug2: channel 0: request env confirm 0
>> debug3: Ignored env AWS_CLOUDWATCH_HOME
>> debug3: Ignored env AWS_IAM_HOME
>> debug3: Ignored env SHLVL
>> debug3: Ignored env HOME
>> debug3: Ignored env AWS_PATH
>> debug3: Ignored env AWS_AUTO_SCALING_HOME
>> debug3: Ignored env LOGNAME
>> debug3: Ignored env AWS_ELB_HOME
>> debug3: Ignored env SSH_CONNECTION
>> debug3: Ignored env LESSOPEN
>> debug3: Ignored env AWS_RDS_HOME
>> debug3: Ignored env G_BROKEN_FILENAMES
>> debug3: Ignored env _
>> debug3: Ignored env OLDPWD
>> debug3: Ignored env OMPI_MCA_plm
>> debug1: Sending command: orted --daemonize -mca ess env -mca
>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>> debug2: channel 0: request exec confirm 1
>> debug2: fd 3 setting TCP_NODELAY
>> debug2: callback done
>> debug2: channel 0: open confirm rwindow 0 rmax 32768
>> debug3: Wrote 272 bytes for a total of 1893
>> debug2: channel 0: rcvd adjust 2097152
>> debug2: channel_input_status_confirm: type 99 id 0
>> debug2: exec request accepted on channel 0
>> debug2: channel 0: read<=0 rfd 4 len 0
>> debug2: channel 0: read failed
>> debug2: channel 0: close_read
>> debug2: channel 0: input open -> drain
>> debug2: channel 0: ibuf empty
>> debug2: channel 0: send eof
>> debug2: channel 0: input drain -> closed
>> debug3: Wrote 32 bytes for a total of 1925
>> debug2: channel 0: rcvd eof
>> debug2: channel 0: output open -> drain
>> debug2: channel 0: obuf empty
>> debug2: channel 0: close_write
>> debug2: channel 0: output drain -> closed
>> debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
>> debug2: channel 0: rcvd close
>> debug3: channel 0: will not send data after close
>> debug2: channel 0: almost dead
>> debug2: channel 0: gc: notify user
>> debug2: channel 0: gc: user detached
>> debug2: channel 0: send close
>> debug2: channel 0: is dead
>> debug2: channel 0: garbage collecting
>> debug1: channel 0: free: client-session, nchannels 1
>> debug3: channel 0: status: The following connections are open:
>> #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
>>
>> debug3: channel 0: close_fds r -1 w -1 e 6 c -1
>> debug3: Wrote 32 bytes for a total of 1957
>> debug3: Wrote 64 bytes for a total of 2021
>> debug1: fd 0 clearing O_NONBLOCK
>> Transferred: sent 1840, received 1896 bytes, in 0.1 seconds
>> Bytes per second: sent 18384.8, received 18944.3
>> debug1: Exit status 0
>> # it is hanging; I am about to issue control-C
>> ^Cmpirun: killing job...
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>> back when launched
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e.,
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # daemon did not report back when
>> launched
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # what does that mean?
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming...
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ # I give up
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$
>> [tsakai_at_domU-12-31-39-16-75-1E ~]$ exit
>> logout
>> [tsakai_at_vixen ec2]$
>> [tsakai_at_vixen ec2]$
>>
>> Do you see anything strange?
>>
>> One final question: On ssh man page, it mentions a few environmental
>> varialbles. SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc. Do
>> any of these matter as far as openMPI is concerned?
>>
>> Thank you, Gus.
>>
>> Regards,
>>
>> Tena
>>
>> On 2/15/11 5:09 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>>
>>> Tena Sakai wrote:
>>>> Hi,
>>>>
>>>> I am trying to reproduce what I was able to show last Friday on Amazon
>>>> EC2 instances, but I am having a problem. What I was able to show last
>>>> Friday as root was with this command:
>>>> mpirun ­app app.ac
>>>> with app.ac being:
>>>> -H dns-entry-A ­np 1 (linux command)
>>>> -H dns-entry-A ­np 1 (linux command)
>>>> -H dns-entry-B ­np 1 (linux command)
>>>> -H dns-entry-B ­np 1 (linux command)
>>>>
>>>> Here¹s the config file in root¹s .ssh directory:
>>>> Host *
>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>> IdentitiesOnly yes
>>>> BatchMode yes
>>>>
>>>> Yesterday and today I can¹t get this to work. I made the last part of
>>>> app.ac
>>>> file simpler (it now says /bin/hostname). Below is the session:
>>>>
>>>> -bash-3.2#
>>>> -bash-3.2# # I am on instance A, host name for inst A is:
>>>> -bash-3.2# hostname
>>>> domU-12-31-39-09-CD-C2
>>>> -bash-3.2#
>>>> -bash-3.2# nslookup domU-12-31-39-09-CD-C2
>>>> Server: 172.16.0.23
>>>> Address: 172.16.0.23#53
>>>>
>>>> Non-authoritative answer:
>>>> Name: domU-12-31-39-09-CD-C2.compute-1.internal
>>>> Address: 10.210.210.48
>>>>
>>>> -bash-3.2# cd .ssh
>>>> -bash-3.2#
>>>> -bash-3.2# cat config
>>>> Host *
>>>> IdentityFile /root/.ssh/.derobee/.kagi
>>>> IdentitiesOnly yes
>>>> BatchMode yes
>>>> -bash-3.2#
>>>> -bash-3.2# ll config
>>>> -rw-r--r-- 1 root root 103 Feb 15 17:18 config
>>>> -bash-3.2#
>>>> -bash-3.2# chmod 600 config
>>>> -bash-3.2#
>>>> -bash-3.2# # show I can go to inst B without password/passphrase
>>>> -bash-3.2#
>>>> -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal
>>>> Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48
>>>> -bash-3.2#
>>>> -bash-3.2# hostname
>>>> domU-12-31-39-09-E6-71
>>>> -bash-3.2#
>>>> -bash-3.2# nslookup `hostname`
>>>> Server: 172.16.0.23
>>>> Address: 172.16.0.23#53
>>>>
>>>> Non-authoritative answer:
>>>> Name: domU-12-31-39-09-E6-71.compute-1.internal
>>>> Address: 10.210.233.123
>>>>
>>>> -bash-3.2# # and back to inst A is also no problem
>>>> -bash-3.2#
>>>> -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal
>>>> Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1
>>>> -bash-3.2#
>>>> -bash-3.2# hostname
>>>> domU-12-31-39-09-CD-C2
>>>> -bash-3.2#
>>>> -bash-3.2# # log out twice to go back to inst A
>>>> -bash-3.2# exit
>>>> logout
>>>> Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed.
>>>> -bash-3.2#
>>>> -bash-3.2# exit
>>>> logout
>>>> Connection to domU-12-31-39-09-E6-71.compute-1.internal closed.
>>>> -bash-3.2#
>>>> -bash-3.2# hostname
>>>> domU-12-31-39-09-CD-C2
>>>> -bash-3.2#
>>>> -bash-3.2# cd ..
>>>> -bash-3.2#
>>>> -bash-3.2# pwd
>>>> /root
>>>> -bash-3.2#
>>>> -bash-3.2# ll
>>>> total 8
>>>> -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac
>>>> -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2
>>>> -bash-3.2#
>>>> -bash-3.2# cat app.ac
>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>> -bash-3.2#
>>>> -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs
>>>> -bash-3.2# mpirun -app app.ac
>>>> mpirun: killing job...
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>> below. Additional manual cleanup may be required - please refer to
>>>> the "orte-clean" tool for assistance.
>>>>
>>>> --------------------------------------------------------------------------
>>>> domU-12-31-39-09-E6-71.compute-1.internal - daemon did not
>>>> report back when launched
>>>> -bash-3.2#
>>>> -bash-3.2# cat app.ac2
>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>> -bash-3.2#
>>>> -bash-3.2# # when there is no remote machine, then mpirun works:
>>>> -bash-3.2# mpirun -app app.ac2
>>>> domU-12-31-39-09-CD-C2
>>>> domU-12-31-39-09-CD-C2
>>>> -bash-3.2#
>>>> -bash-3.2# hostname
>>>> domU-12-31-39-09-CD-C2
>>>> -bash-3.2#
>>>> -bash-3.2# # this gotta be ssh problem....
>>>> -bash-3.2#
>>>> -bash-3.2# # show no firewall is used
>>>> -bash-3.2# iptables --list
>>>> Chain INPUT (policy ACCEPT)
>>>> target prot opt source destination
>>>>
>>>> Chain FORWARD (policy ACCEPT)
>>>> target prot opt source destination
>>>>
>>>> Chain OUTPUT (policy ACCEPT)
>>>> target prot opt source destination
>>>> -bash-3.2#
>>>> -bash-3.2# exit
>>>> logout
>>>> [tsakai_at_vixen ec2]$
>>>>
>>>> Would someone please point out what I am doing wrong?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>>
>>>> Tena
>>>>
>>> Hi Tena
>>>
>>> Nothing wrong that I can see.
>>> Just another couple of suggestions,
>>> based on somewhat vague possibilities.
>>>
>>> A slight difference is that on vixen and dashen you ran the
>>> MPI hostname tests as a regular user, not as root, right?
>>> Not sure if this will make much of a difference,
>>> but it may be worth trying to run it as a regular user in EC2 also.
>>> I general most people avoid running user applications (MPI programs
>>> included) as root.
>>> Mostly for safety, but I wonder if there are any
>>> implications in the 'rootly powers'
>>> regarding the under-the-hood processes that OpenMPI
>>> launches along with the actual user programs.
>>>
>>> This may make no difference either,
>>> but you could do a 'service iptables status',
>>> to see if the service is running, even though there are
>>> no explicit iptable rules (as per your email).
>>> If the service is not running you get
>>> 'Firewall is stopped.' (in CentOS).
>>> I *think* 'iptables --list' loads the iptables module into the
>>> kernel, as a side effect, whereas the service command does not.
>>> So, it may be cleaner (safer?) to use the service version
>>> instead of 'iptables --list'.
>>> I don't know if it will make any difference,
>>> but just in case, if the service is running,
>>> why not do 'service iptables stop',
>>> and perhaps also 'chkconfig iptables off' to be completely
>>> free of iptables?
>>>
>>> Gus Correa
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users