Tena
Earlier today I was able to successfully get a
submission host[ec2 instance 0]
<-> slave [ec2 instance 1]
configuration to work. I haven't fully digested your "this must be
an ssh ... " thread. But here are few things that I found it
necessary to do, in order to get things working.
(i) First and foremost is the ec2 security group. The 'default'
group will probably not work. ompi randomly chooses ports. I think
that some ranges are excluded, but I was too lazy to find out, so I
just opened everything up, creating a group that includes the
line
Connection Method Protocol From port To port Source (IP
or group)
All tcp 0
65535 0.0.0.0/0
Of course this could be insecure, depending how your instance is
configured. Since I have no services running except ssh, I'm don't
foresee any problems.
(ii) Since you have ssh working, this probably is irrelevant: by
default when ompi uses ssh, it attempts to log into the remote host
using the local user name, and will use the rsa file
$USER/.ssh/id_rsa. However, you can explicitly set these by
specifying the ssh command in an MCA param, e.g.
OMPI_MCA_plm_rsh_agent="ssh
-i rsa_file -l
ec2-user"; export OMPI_MCA_plm_rsh_agent
And the rsa file must have mode 600.
(iii) To supress the ssh authenticity test, I added
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
to /etc/ssh/ssh_config
Hope one of these helps.
bw
On 2/17/11 6:11 PM, Tena Sakai wrote:
Re: [OMPI users] How are IP addresses determined?
Hi Barnet,
> If I understand you correctly, the configuration you're
trying to use Is
> submission host[ec2 instance 0] <-> slave [ec2
instance 1]
Correct.
> but have you tried using the public/external uri?
I just did. It didn’t make a bit of difference.
I also tried IP addresses and that didn’t get me anywhere
either.
Jeff earlier gave me steps to follow, which I am about to
embark on.
May I suggest you follow a thread with heading “This must be ssh
problem, but I can't figure out what it is...”
Regards,
Tena
On 2/17/11 10:05 AM, "Barnet Wagman" <bw@norbl.com>
wrote:
Tena,
If I understand you correctly, the configuration you're
trying to use is
submission host[ec2 instance 0]
<-> slave [ec2 instance 1]
I haven't tried this yet (although
I will in the next few days).
I've tried
(a) submission host[non-ec2
system with static IP, direct net connection] <->
slave [ec2 instance 1]
(b) submission host[non-ec2 system with local static IP,
connected to net via router] <-> slave [ec2 instance
1]
(a) works, (b) does not,
presumably because opmpi does not support NAT (see Jeff
Squyres comments, later in the thread).
I notice that you're using the 'internal' uri to specify
hostnames. This makes sense in principle, but have you tried
using the public/external uri? Presumably opmpi has to
lookup these hostnames. I don't know how that's done, but
trying to lookup the internal uri might be a problem.
If you try this (or anything else), I'd appreciate it if
you'd post your results.
bw
On 2/17/11 4:08 AM, Tena Sakai wrote:
Re: [OMPI users] How are IP
addresses determined? Hi Barnet,
Allow me to interject.
Are you saying that you run master on your local machine
and launching openMPI process on EC2? You are saying that
1) tcp port tcp://192.168.1.101:35272 is on your local
system and 2) the ec2 instance is trying to connect your
local machine’s port 35272 , and hanging. Is that
correct?
I have just a bit different situation. I am running 2
ec2 instances and trying to run mpirun on both instances.
My ssh debug output looks quite similar to yours and
mpirun behavior also very similar. Here’s what I
captured:
Sending command: orted --daemonize -mca ess env -mca
orte_ess_jobid 1025769472 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 2 --hnp-uri
"1025769472.0;tcp://10.118.23.4:60941"
And here’s what I did on the instance from which I issued
mpirun:
[tsakai@ip-10-118-23-4 ~]$ nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: ip-10-118-23-4.ec2.internal
Address: 10.118.23.4
So that tcp port does belong to this instance.
Furthermore, it cannot come into it. No router (which
may perform address translation?) is involved and it
appears the same thing as what you describe is happening.
Incidentally, here’s how I ran mpirun:
[tsakai@ip-10-118-23-4 ~]$ mpirun -app app.ac
With app.ac file:
[tsakai@ip-10-118-23-4 ~]$ cat app.ac
-H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname
-H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname
-H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname
-H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname
The first two lines spawns /bin/hostname on this instance
(ip-10-118-23-4.ec2.internal) and the bottom 2 lines on
the remote instance.
Here’s the security group used for these instances:
connetion protocol from to source
------------- ----------- ------ -----
------------
SSH tcp
22 22 0.0.0.0/0
Am I
making sense?
Regards,
Tena
On 2/16/11 8:56 PM, "Barnet Wagman" <bw@norbl.com>
wrote:
I've run into a problem
involving accessing a remote host via a router and I
think need to understand how opmpi determines ip
addresses. If there's anything posted on this subject,
please point me to it.
Here's the problem:
I've installed opmpi (1.4.3) on a remote system (an
Amazon ec2 instance). If the local system I'm working
on has a static ip address (and a direct connection to
the internet), there's no problem. But if the local
system accesses the internet through a router (which
itself gets it's ip via dhcp), a call to runmpi command
hangs.
This is not firewall problem - I've disabled the
firewalls on all the system that are involved (and the
router).
It is also not an ssh problem. The ssh connection is
being made and it appears that the application has been
launched on the remote system. After the runmpi command
has been launched locally, a ps on the remote system
shows a process
orted --daemonize -mca ess
env -mca orte_ess_jobid 1187643392 -mca orte_ess_vpid
1 -mca orte_ess_num_procs 2 --hnp-uri
1187643392.0;tcp://192.168.1.101:35272
While I don't really understand the orted process, I
assume this indicates that a command to execute an app
has been received and that opmpi is trying to run it.
I suspect that the problem is related to the
'--hnp-uri ... tcp://192.168.1.101' argument.
192.168.1.101 is the address of my local system on my
local network (attached to the router), which of course
is not accessible over the net. It appears that opmpi
is transmitting the local (static) ip address to the
remote host.
It would help to know how opmpi determines and
distributes IP addresses. And if there's any way to
control this.
Any thoughts on dealing with this would be greatly
appreciated.
Thanks,
bw
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users