Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Tena Sakai (tsakai_at_[hidden])
Date: 2011-02-14 19:52:42


Hi Gus,

> Hence, I don't understand why the lack of symmetry in the
> firewall protection.
> Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
> Maybe dashen was installed later, just got whatever boilerplate firewall
> that comes with RedHat, CentOS, Fedora.
> If there is a gateway for this LAN somewhere with another firewall,
> which is probably the case,

You are correct. We had a system administrator, but we lost
that person and I installed dasher from scratch myslef and
I did use boilerplage firewall from centos 5.5 distribution.

> Do you have Internet access from either machine?

Yes, I do.

> Vixen has yet another private IP 10.1.1.2 (eth0),
> with a bit weird combination of broadcast address 192.168.255.255(?),
> mask 255.0.0.0.
> vixen is/was part of another group of machines, via this other IP,
> cluster perhaps?

We have a Rocks HPC cluster. The cluster head is called blitzen
and there are 8 nodes in the cluster. We have completely outgrown
this setting. For example, I am running an application for last
2 weeks with 4 of 8 nodes and the other 4 nodes have been used
by my colleagues and I expect my jobs to run another 2-3 weeks.
Which is why I am interested in cloud.

Vixen is not part of the Rocks cluster, but it is an nfs server,
as well as database server. Here's ifconfig of blitzen:

  [tsakai_at_blitzen Rmpi]$ ifconfig
  eth0 Link encap:Ethernet HWaddr 00:19:B9:E0:C0:0B
            inet addr:10.1.1.1 Bcast:10.255.255.255 Mask:255.0.0.0
            inet6 addr: fe80::219:b9ff:fee0:c00b/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
            RX packets:58859908 errors:0 dropped:0 overruns:0 frame:0
            TX packets:38795319 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:14637456238 (13.6 GiB) TX bytes:25487423161 (23.7 GiB)
            Interrupt:193 Memory:ec000000-ec012100
  
  eth1 Link encap:Ethernet HWaddr 00:19:B9:E0:C0:0D
            inet addr:172.16.1.106 Bcast:172.16.3.255 Mask:255.255.252.0
            inet6 addr: fe80::219:b9ff:fee0:c00d/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
            RX packets:99465693 errors:0 dropped:0 overruns:0 frame:0
            TX packets:46026372 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:44685802310 (41.6 GiB) TX bytes:28223858173 (26.2 GiB)
            Interrupt:193 Memory:ea000000-ea012100
  
  lo Link encap:Local Loopback
            inet addr:127.0.0.1 Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING MTU:16436 Metric:1
            RX packets:80078179 errors:0 dropped:0 overruns:0 frame:0
            TX packets:80078179 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:27450135463 (25.5 GiB) TX bytes:27450135463 (25.5 GiB)

And here's the same thing of vixen:
[tsakai_at_vixen Rmpi]$ cat moo
  eth0 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:31
            inet addr:10.1.1.2 Bcast:192.168.255.255 Mask:255.0.0.0
            inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
            RX packets:61942079 errors:0 dropped:0 overruns:0 frame:0
            TX packets:61950934 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:47837093368 (44.5 GiB) TX bytes:54525223424 (50.7 GiB)
            Interrupt:185 Memory:ea000000-ea012100
  
  eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33
            inet addr:172.16.1.107 Bcast:172.16.3.255 Mask:255.255.252.0
            inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
            RX packets:5204606192 errors:0 dropped:0 overruns:0 frame:0
            TX packets:8935890067 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:371146631795 (345.6 GiB) TX bytes:13424275898600 (12.2
TiB)
            Interrupt:193 Memory:ec000000-ec012100
  
  lo Link encap:Local Loopback
            inet addr:127.0.0.1 Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING MTU:16436 Metric:1
            RX packets:244240818 errors:0 dropped:0 overruns:0 frame:0
            TX packets:244240818 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:1190988294201 (1.0 TiB) TX bytes:1190988294201 (1.0
TiB)

I think you are also correct as to:

> a bit weird combination of broadcast address 192.168.255.255 (?),
> and mask 255.0.0.0.

I think they are both misconfigured. I will fix them when I can.

> What is in your ${TORQUE}/server_priv/nodes file?
> IPs or names (vixen & dashen).

We don't use TORQUE. We do use SGE from blitzen.

> Are they on a DNS server or do you resolve their names/IPs
> via /etc/hosts?
> Hopefully vixen's name resolves as 172.16.1.107.

They are on dns server:

  [tsakai_at_dasher Rmpi]$ nslookup vixen.egcrc.org
  Server: 172.16.1.2
  Address: 172.16.1.2#53

  Name: vixen.egcrc.org
  Address: 172.16.1.107

  [tsakai_at_dasher Rmpi]$ nslookup blitzen
  Server: 172.16.1.2
  Address: 172.16.1.2#53

  Name: blitzen.egcrc.org
  Address: 172.16.1.106

  [tsakai_at_dasher Rmpi]$
  [tsakai_at_dasher Rmpi]$

One more point that I over looked in a previous post:

> I have yet to understand whether you copy your compiled tools
> (OpenMPI, R, etc) from your local machines to EC2,
> or if you build/compile them directly on the EC2 environment.

Tools like OpenMPI, R, and for that matter gcc, must be part
of ami. The ami is stored on amazon device, it could be on
an S3 (simple storage server) or volume (which is what Ashley
recommends). So I put R and everything I needed on the ami
before I uploaded it onto amazon. Only I didn't put OpenMPI
on it. I did wget from my ami instance to download OpenMPI
source, compiled it on the instance, and saved that image
on S3. So now when I launch the instance OpenMPI is part of
the ami.

> Also, it's not clear to me if the OS in EC2 is an image
> from your local machines' OS/Linux distro, or independent of them,
> or if you can choose to have it either way.

The OS in EC2 is either linux or windows. (I have never
used windows in my life.) For linux, it can be any linux
as one chooses. In my case, I built an ami from centos
distribution with everything I needed. It is essentially
the same thing as dasher.

> On another posting, Ashley Pittman reported to
> be using OpenMPI in Amazon EC2 without problems,
> suggests pathway and gives several tips for that.
> That is probably a more promising path,
> which you may want to try.

I have a feeling that I will be in need of more help
from her.

Regards,

Tena

On 2/14/11 3:46 PM, "Gus Correa" <gus_at_[hidden]> wrote:

> Tena Sakai wrote:
>> Hi Kevin,
>>
>> Thanks for your reply.
>> Dasher is physically located under my desk and vixen is in a
>> cecure data center.
>>
>>> does dasher have any network interfaces that vixen does not?
>>
>> No, I don't think so.
>> Here is more definitive info:
>> [tsakai_at_dasher Rmpi]$ ifconfig
>> eth0 Link encap:Ethernet HWaddr 00:1A:A0:E1:84:A9
>> inet addr:172.16.0.116 Bcast:172.16.3.255 Mask:255.255.252.0
>> inet6 addr: fe80::21a:a0ff:fee1:84a9/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:2347 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:1005 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:100
>> RX bytes:531809 (519.3 KiB) TX bytes:269872 (263.5 KiB)
>> Memory:c2200000-c2220000
>>
>> lo Link encap:Local Loopback
>> inet addr:127.0.0.1 Mask:255.0.0.0
>> inet6 addr: ::1/128 Scope:Host
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>> RX packets:74 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:74 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:7824 (7.6 KiB) TX bytes:7824 (7.6 KiB)
>>
>> [tsakai_at_dasher Rmpi]$
>>
>> However, vixen has two ethernet[tsakai_at_vixen Rmpi]$ cat moo
>> [root_at_vixen ec2]# /sbin/ifconfig
>> eth0 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:31
>> inet addr:10.1.1.2 Bcast:192.168.255.255 Mask:255.0.0.0
>> inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:61913135 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:61923635 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:47832124690 (44.5 GiB) TX bytes:54515478860 (50.7 GiB)
>> Interrupt:185 Memory:ea000000-ea012100
>>
>> eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33
>> inet addr:172.16.1.107 Bcast:172.16.3.255 Mask:255.255.252.0
>> inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:5204431112 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:8935796075 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:371123590892 (345.6 GiB) TX bytes:13424246629869 (12.2
>> TiB)
>> Interrupt:193 Memory:ec000000-ec012100
>>
>> lo Link encap:Local Loopback
>> inet addr:127.0.0.1 Mask:255.0.0.0
>> inet6 addr: ::1/128 Scope:Host
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>> RX packets:244169216 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:244169216 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:1190976360356 (1.0 TiB) TX bytes:1190976360356 (1.0
>> TiB)
>>
>> [root_at_vixen ec2]# interfaces:
>>
>> Please see the mail posting that follows this, my reply to Ashley,
>> whom nailed the problem precisely.
>>
>> Regards,
>>
>> Tena
>>
>>
>> On 2/14/11 1:35 PM, "Kevin.Buckley_at_[hidden]"
>> <Kevin.Buckley_at_[hidden]> wrote:
>>
>>> This probably shows my lack of understanding as to how OpenMPI
>>> negotiates the connectivity between nodes when given a choice
>>> of interfaces but anyway:
>>>
>>> does dasher have any network interfaces that vixen does not?
>>>
>>> The scenario I am imgaining would be that you ssh into dasher
>>> from vixen using a "network" that both share and similarly, when
>>> you mpirun from vixen, the network that OpenMPI uses is constrained
>>> by the interfaces that can be seen from vixen, so you are fine.
>>>
>>> However when you are on dasher, mpirun sees another interface which
>>> it takes a liking to and so tries to use that, but that interface
>>> is not available to vixen so the OpenMPI processes spawned there
>>> terminate when they can't find that interface so as to talk back
>>> to dasher's controlling process.
>>>
>>> I know that you are no longer working with VMs but it's along those
>>> lines that I was thinking: extra network interfaces that you assume
>>> won't be used but which are and which could then be overcome by use
>>> of an explicit
>>>
>>> --mca btl_tcp_if_exclude virbr0
>>>
>>> or some such construction (virbr0 used as an example here).
>>>
>>> Kevin
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Hi Tena
>
>
> They seem to be connected through the LAN 172.16.0.0/255.255.252.0,
> with private IPs 172.16.0.116 (dashen,eth0) and
> 172.16.1.107 (vixen,eth1).
> These addresses are probably what OpenMPI is using.
> Not much like a cluster, but just machines in a LAN.
>
> Hence, I don't understand why the lack of symmetry in the
> firewall protection.
> Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
> Maybe dashen was installed later, just got whatever boilerplate firewall
> that comes with RedHat, CentOS, Fedora.
> If there is a gateway for this LAN somewhere with another firewall,
> which is probably the case,
> I'd guess it is OK to turn off dashen's firewall.
>
> Do you have Internet access from either machine?
>
> Vixen has yet another private IP 10.1.1.2 (eth0),
> with a bit weird combination of broadcast address 192.168.255.255 (?),
> and mask 255.0.0.0.
> Maybe vixen is/was part of another group of machines, via this other IP,
> a cluster perhaps?
>
> What is in your ${TORQUE}/server_priv/nodes file?
> IPs or names (vixen & dashen).
>
> Are they on a DNS server or do you resolve their names/IPs
> via /etc/hosts?
>
> Hopefully vixen's name resolves as 172.16.1.107.
> (ping -R vixen may tell).
>
> Gus Correa
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users