Tena,
If I understand you correctly, the configuration you're trying to
use is
submission host[ec2 instance 0] <-> slave [ec2
instance 1]
I haven't tried this yet (although I will in the next few days).
I've tried
(a) submission host[non-ec2 system with static IP,
direct net connection] <-> slave [ec2 instance 1]
(b) submission host[non-ec2 system with local static IP,
connected to net via router] <-> slave [ec2 instance 1]
(a) works, (b) does not, presumably because opmpi does not support
NAT (see Jeff Squyres comments, later in the thread).
I notice that you're using the 'internal' uri to specify hostnames.
This makes sense in principle, but have you tried using the
public/external uri? Presumably opmpi has to lookup these
hostnames. I don't know how that's done, but trying to lookup the
internal uri might be a problem.
If you try this (or anything else), I'd appreciate it if you'd post
your results.
bw
On 2/17/11 4:08 AM, Tena Sakai wrote:
Re: [OMPI users] How are IP addresses determined?
Hi Barnet,
Allow me to interject.
Are you saying that you run master on your local machine and
launching openMPI process on EC2? You are saying that 1) tcp
port tcp://192.168.1.101:35272 is on your local system and 2)
the ec2 instance is trying to connect your local machine’s
port 35272 , and hanging. Is that correct?
I have just a bit different situation. I am running 2 ec2
instances and trying to run mpirun on both instances. My ssh
debug output looks quite similar to yours and mpirun behavior
also very similar. Here’s what I captured:
Sending command: orted --daemonize -mca ess env -mca
orte_ess_jobid 1025769472 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 2 --hnp-uri
"1025769472.0;tcp://10.118.23.4:60941"
And here’s what I did on the instance from which I issued
mpirun:
[tsakai@ip-10-118-23-4 ~]$ nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: ip-10-118-23-4.ec2.internal
Address: 10.118.23.4
So that tcp port does belong to this instance. Furthermore,
it cannot come into it. No router (which may perform address
translation?) is involved and it appears the same thing as
what you describe is happening. Incidentally, here’s how I
ran mpirun:
[tsakai@ip-10-118-23-4 ~]$ mpirun -app app.ac
With app.ac file:
[tsakai@ip-10-118-23-4 ~]$ cat app.ac
-H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname
-H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname
-H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname
-H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname
The first two lines spawns /bin/hostname on this instance
(ip-10-118-23-4.ec2.internal) and the bottom 2 lines on the
remote instance.
Here’s the security group used for these instances:
connetion protocol from to source
------------- ----------- ------ -----
------------
SSH tcp
22 22 0.0.0.0/0
Am I making sense?
Regards,
Tena
On 2/16/11 8:56 PM, "Barnet Wagman" <bw@norbl.com>
wrote:
I've run into a problem
involving accessing a remote host via a router and I think
need to understand how opmpi determines ip addresses. If
there's anything posted on this subject, please point me to
it.
Here's the problem:
I've installed opmpi (1.4.3) on a remote system (an Amazon
ec2 instance). If the local system I'm working on has a
static ip address (and a direct connection to the internet),
there's no problem. But if the local system accesses the
internet through a router (which itself gets it's ip via
dhcp), a call to runmpi command hangs.
This is not firewall problem - I've disabled the firewalls
on all the system that are involved (and the router).
It is also not an ssh problem. The ssh connection is being
made and it appears that the application has been launched
on the remote system. After the runmpi command has been
launched locally, a ps on the remote system shows a process
orted --daemonize -mca ess env
-mca orte_ess_jobid 1187643392 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 2 --hnp-uri
1187643392.0;tcp://192.168.1.101:35272
While I don't really understand the orted process, I assume
this indicates that a command to execute an app has been
received and that opmpi is trying to run it.
I suspect that the problem is related to the '--hnp-uri ...
tcp://192.168.1.101' argument. 192.168.1.101 is the address
of my local system on my local network (attached to the
router), which of course is not accessible over the net. It
appears that opmpi is transmitting the local (static) ip
address to the remote host.
It would help to know how opmpi determines and distributes
IP addresses. And if there's any way to control this.
Any thoughts on dealing with this would be greatly
appreciated.
Thanks,
bw
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users