Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)
From: Gus Correa (gus_at_[hidden])
Date: 2011-02-14 22:05:46


Hi Tena

Answers inline.

Tena Sakai wrote:
> Hi Gus,
>
>> Hence, I don't understand why the lack of symmetry in the
>> firewall protection.
>> Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
>> Maybe dashen was installed later, just got whatever boilerplate firewall
>> that comes with RedHat, CentOS, Fedora.
>> If there is a gateway for this LAN somewhere with another firewall,
>> which is probably the case,
>
> You are correct. We had a system administrator, but we lost
> that person and I installed dasher from scratch myslef and
> I did use boilerplage firewall from centos 5.5 distribution.
>

I read your answer to Ashley and Reuti telling that you
turned the firewall off and OpenMPI now works with vixen and dashen.
That's good news!

>> Do you have Internet access from either machine?
>
> Yes, I do.

The LAN gateway is probably doing NAT.
I would guess it also has its own firewall.
Is there anybody there that could tell you about this?

>
>> Vixen has yet another private IP 10.1.1.2 (eth0),
>> with a bit weird combination of broadcast address 192.168.255.255(?),
>> mask 255.0.0.0.
>> vixen is/was part of another group of machines, via this other IP,
>> cluster perhaps?
>
> We have a Rocks HPC cluster. The cluster head is called blitzen
> and there are 8 nodes in the cluster. We have completely outgrown
> this setting. For example, I am running an application for last
> 2 weeks with 4 of 8 nodes and the other 4 nodes have been used
> by my colleagues and I expect my jobs to run another 2-3 weeks.
> Which is why I am interested in cloud.
>
> Vixen is not part of the Rocks cluster, but it is an nfs server,
> as well as database server. Here's ifconfig of blitzen:
>
> [tsakai_at_blitzen Rmpi]$ ifconfig
> eth0 Link encap:Ethernet HWaddr 00:19:B9:E0:C0:0B
> inet addr:10.1.1.1 Bcast:10.255.255.255 Mask:255.0.0.0
> inet6 addr: fe80::219:b9ff:fee0:c00b/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:58859908 errors:0 dropped:0 overruns:0 frame:0
> TX packets:38795319 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:14637456238 (13.6 GiB) TX bytes:25487423161 (23.7 GiB)
> Interrupt:193 Memory:ec000000-ec012100
>
> eth1 Link encap:Ethernet HWaddr 00:19:B9:E0:C0:0D
> inet addr:172.16.1.106 Bcast:172.16.3.255 Mask:255.255.252.0
> inet6 addr: fe80::219:b9ff:fee0:c00d/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:99465693 errors:0 dropped:0 overruns:0 frame:0
> TX packets:46026372 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:44685802310 (41.6 GiB) TX bytes:28223858173 (26.2 GiB)
> Interrupt:193 Memory:ea000000-ea012100
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:80078179 errors:0 dropped:0 overruns:0 frame:0
> TX packets:80078179 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:27450135463 (25.5 GiB) TX bytes:27450135463 (25.5 GiB)
>
> And here's the same thing of vixen:
> [tsakai_at_vixen Rmpi]$ cat moo
> eth0 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:31
> inet addr:10.1.1.2 Bcast:192.168.255.255 Mask:255.0.0.0
> inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:61942079 errors:0 dropped:0 overruns:0 frame:0
> TX packets:61950934 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:47837093368 (44.5 GiB) TX bytes:54525223424 (50.7 GiB)
> Interrupt:185 Memory:ea000000-ea012100
>
> eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33
> inet addr:172.16.1.107 Bcast:172.16.3.255 Mask:255.255.252.0
> inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:5204606192 errors:0 dropped:0 overruns:0 frame:0
> TX packets:8935890067 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:371146631795 (345.6 GiB) TX bytes:13424275898600 (12.2
> TiB)
> Interrupt:193 Memory:ec000000-ec012100
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:244240818 errors:0 dropped:0 overruns:0 frame:0
> TX packets:244240818 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:1190988294201 (1.0 TiB) TX bytes:1190988294201 (1.0
> TiB)
>
> I think you are also correct as to:
>
>> a bit weird combination of broadcast address 192.168.255.255 (?),
>> and mask 255.0.0.0.
>
> I think they are both misconfigured. I will fix them when I can.
>

Blitzen's configuration looks like standard Rocks to me:
eth0 for private net, eth1 for LAN or WAN.
I think it is not misconfigured.

Also, beware that Rocks has its own ways/commands to configure things
(i.e., '$ rocks do this and that').
Using the Linux tools directly sometimes breaks or leaves loose
ends on Rocks.

Vixen eth0 looks weird, but now that you mentioned your Rocks cluster,
it may be that its eth0 is used to connect vixen to the
cluster's private subnet, and serve NFS to it.
Still the Bcast address doesn't look right.
I would expect it to be 10.255.255.255 (as in blitzen's eth0), if vixen
serves NFS to the cluster via eth0.

>> What is in your ${TORQUE}/server_priv/nodes file?
>> IPs or names (vixen & dashen).
>
> We don't use TORQUE. We do use SGE from blitzen.
>

Oh, sorry, you said before you don't use Torque.
I forgot that one.

What I really meant to ask is about your OpenMPI hostfile,
or how the --app file refers to the machines,
but I guess you use host names there, not IPs.

>> Are they on a DNS server or do you resolve their names/IPs
>> via /etc/hosts?
>> Hopefully vixen's name resolves as 172.16.1.107.
>
> They are on dns server:
>
> [tsakai_at_dasher Rmpi]$ nslookup vixen.egcrc.org
> Server: 172.16.1.2
> Address: 172.16.1.2#53
>
> Name: vixen.egcrc.org
> Address: 172.16.1.107
>
> [tsakai_at_dasher Rmpi]$ nslookup blitzen
> Server: 172.16.1.2
> Address: 172.16.1.2#53
>
> Name: blitzen.egcrc.org
> Address: 172.16.1.106
>
> [tsakai_at_dasher Rmpi]$
> [tsakai_at_dasher Rmpi]$
>

DNS makes it easier for you, specially on a LAN, where machines
change often in ways that you can't control.
You don't need to worry about resolving names with /etc/hosts,
which is an the easy thing to do in a cluster.

> One more point that I over looked in a previous post:
>
>> I have yet to understand whether you copy your compiled tools
>> (OpenMPI, R, etc) from your local machines to EC2,
>> or if you build/compile them directly on the EC2 environment.
>
> Tools like OpenMPI, R, and for that matter gcc, must be part
> of ami. The ami is stored on amazon device, it could be on
> an S3 (simple storage server) or volume (which is what Ashley
> recommends). So I put R and everything I needed on the ami
> before I uploaded it onto amazon. Only I didn't put OpenMPI
> on it. I did wget from my ami instance to download OpenMPI
> source, compiled it on the instance, and saved that image
> on S3. So now when I launch the instance OpenMPI is part of
> the ami.
>

It is more clear to me now.
It sounds right, although other than storage,
I can't fathom the difference between what you
did and what Ashley suggested.
Yet, somehow Ashley got it to work.
There may be something to pursue there.

>> Also, it's not clear to me if the OS in EC2 is an image
>> from your local machines' OS/Linux distro, or independent of them,
>> or if you can choose to have it either way.
>
> The OS in EC2 is either linux or windows. (I have never
> used windows in my life.)

I did.
Don't worry.
It is not a sin. :)

But seriously, from the problems I read on the MPICH2 mailing list,
I it seems to be hard to use it for HPC and parallel programing at least.

> For linux, it can be any linux
> as one chooses. In my case, I built an ami from centos
> distribution with everything I needed. It is essentially
> the same thing as dasher.

Except for the firewall, I suppose.
Did you check if it is turned off on your EC2 replica of dasher?
I don't know if this question makes any sense in the EC2 context,
but maybe it does.

>
>> On another posting, Ashley Pittman reported to
>> be using OpenMPI in Amazon EC2 without problems,
>> suggests pathway and gives several tips for that.
>> That is probably a more promising path,
>> which you may want to try.
>
> I have a feeling that I will be in need of more help
> from her.
>

Save a mistake, I have the feeling that the
Ashley Pitmann we've been talking to is a gentleman:

http://uk.linkedin.com/in/ashleypittman

not the jewelry designer:

http://www.ashleypittman.com/company-ashley-pittman.php

> Regards,
>
> Tena
>
>
>

Best,
Gus

> On 2/14/11 3:46 PM, "Gus Correa" <gus_at_[hidden]> wrote:
>
>> Tena Sakai wrote:
>>> Hi Kevin,
>>>
>>> Thanks for your reply.
>>> Dasher is physically located under my desk and vixen is in a
>>> cecure data center.
>>>
>>>> does dasher have any network interfaces that vixen does not?
>>> No, I don't think so.
>>> Here is more definitive info:
>>> [tsakai_at_dasher Rmpi]$ ifconfig
>>> eth0 Link encap:Ethernet HWaddr 00:1A:A0:E1:84:A9
>>> inet addr:172.16.0.116 Bcast:172.16.3.255 Mask:255.255.252.0
>>> inet6 addr: fe80::21a:a0ff:fee1:84a9/64 Scope:Link
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:2347 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:1005 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:100
>>> RX bytes:531809 (519.3 KiB) TX bytes:269872 (263.5 KiB)
>>> Memory:c2200000-c2220000
>>>
>>> lo Link encap:Local Loopback
>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>> inet6 addr: ::1/128 Scope:Host
>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>> RX packets:74 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:74 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:0
>>> RX bytes:7824 (7.6 KiB) TX bytes:7824 (7.6 KiB)
>>>
>>> [tsakai_at_dasher Rmpi]$
>>>
>>> However, vixen has two ethernet[tsakai_at_vixen Rmpi]$ cat moo
>>> [root_at_vixen ec2]# /sbin/ifconfig
>>> eth0 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:31
>>> inet addr:10.1.1.2 Bcast:192.168.255.255 Mask:255.0.0.0
>>> inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:61913135 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:61923635 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:47832124690 (44.5 GiB) TX bytes:54515478860 (50.7 GiB)
>>> Interrupt:185 Memory:ea000000-ea012100
>>>
>>> eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33
>>> inet addr:172.16.1.107 Bcast:172.16.3.255 Mask:255.255.252.0
>>> inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>> RX packets:5204431112 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:8935796075 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:1000
>>> RX bytes:371123590892 (345.6 GiB) TX bytes:13424246629869 (12.2
>>> TiB)
>>> Interrupt:193 Memory:ec000000-ec012100
>>>
>>> lo Link encap:Local Loopback
>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>> inet6 addr: ::1/128 Scope:Host
>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>> RX packets:244169216 errors:0 dropped:0 overruns:0 frame:0
>>> TX packets:244169216 errors:0 dropped:0 overruns:0 carrier:0
>>> collisions:0 txqueuelen:0
>>> RX bytes:1190976360356 (1.0 TiB) TX bytes:1190976360356 (1.0
>>> TiB)
>>>
>>> [root_at_vixen ec2]# interfaces:
>>>
>>> Please see the mail posting that follows this, my reply to Ashley,
>>> whom nailed the problem precisely.
>>>
>>> Regards,
>>>
>>> Tena
>>>
>>>
>>> On 2/14/11 1:35 PM, "Kevin.Buckley_at_[hidden]"
>>> <Kevin.Buckley_at_[hidden]> wrote:
>>>
>>>> This probably shows my lack of understanding as to how OpenMPI
>>>> negotiates the connectivity between nodes when given a choice
>>>> of interfaces but anyway:
>>>>
>>>> does dasher have any network interfaces that vixen does not?
>>>>
>>>> The scenario I am imgaining would be that you ssh into dasher
>>>> from vixen using a "network" that both share and similarly, when
>>>> you mpirun from vixen, the network that OpenMPI uses is constrained
>>>> by the interfaces that can be seen from vixen, so you are fine.
>>>>
>>>> However when you are on dasher, mpirun sees another interface which
>>>> it takes a liking to and so tries to use that, but that interface
>>>> is not available to vixen so the OpenMPI processes spawned there
>>>> terminate when they can't find that interface so as to talk back
>>>> to dasher's controlling process.
>>>>
>>>> I know that you are no longer working with VMs but it's along those
>>>> lines that I was thinking: extra network interfaces that you assume
>>>> won't be used but which are and which could then be overcome by use
>>>> of an explicit
>>>>
>>>> --mca btl_tcp_if_exclude virbr0
>>>>
>>>> or some such construction (virbr0 used as an example here).
>>>>
>>>> Kevin
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Hi Tena
>>
>>
>> They seem to be connected through the LAN 172.16.0.0/255.255.252.0,
>> with private IPs 172.16.0.116 (dashen,eth0) and
>> 172.16.1.107 (vixen,eth1).
>> These addresses are probably what OpenMPI is using.
>> Not much like a cluster, but just machines in a LAN.
>>
>> Hence, I don't understand why the lack of symmetry in the
>> firewall protection.
>> Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
>> Maybe dashen was installed later, just got whatever boilerplate firewall
>> that comes with RedHat, CentOS, Fedora.
>> If there is a gateway for this LAN somewhere with another firewall,
>> which is probably the case,
>> I'd guess it is OK to turn off dashen's firewall.
>>
>> Do you have Internet access from either machine?
>>
>> Vixen has yet another private IP 10.1.1.2 (eth0),
>> with a bit weird combination of broadcast address 192.168.255.255 (?),
>> and mask 255.0.0.0.
>> Maybe vixen is/was part of another group of machines, via this other IP,
>> a cluster perhaps?
>>
>> What is in your ${TORQUE}/server_priv/nodes file?
>> IPs or names (vixen & dashen).
>>
>> Are they on a DNS server or do you resolve their names/IPs
>> via /etc/hosts?
>>
>> Hopefully vixen's name resolves as 172.16.1.107.
>> (ping -R vixen may tell).
>>
>> Gus Correa
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users