Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fwd: problem for multiple clusters using mpirun
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-21 13:05:48


Do you have any firewalling enabled on these machines? If so, you'll want to either disable it, or allow random TCP connections between any of the cluster nodes.

On Mar 21, 2014, at 10:24 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]> wrote:

> /sbin/ifconfig
>
> hsaeed_at_karp:~$ /sbin/ifconfig
> br0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> inet addr:134.106.3.231 Bcast:134.106.3.255 Mask:255.255.255.0
> inet6 addr: fe80::225:90ff:fe59:c9ba/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:49080961 errors:0 dropped:50263 overruns:0 frame:0
> TX packets:43279252 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:41348407558 (38.5 GiB) TX bytes:80505842745 (74.9 GiB)
>
> br1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> inet addr:134.106.53.231 Bcast:134.106.53.255 Mask:255.255.255.0
> inet6 addr: fe80::225:90ff:fe59:c9bb/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:41573060 errors:0 dropped:50261 overruns:0 frame:0
> TX packets:1693509 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:6177072160 (5.7 GiB) TX bytes:230617435 (219.9 MiB)
>
> br2 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> inet addr:10.231.2.231 Bcast:10.231.2.255 Mask:255.255.255.0
> UP BROADCAST MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> eth0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:69108377 errors:0 dropped:0 overruns:0 frame:0
> TX packets:86459066 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:43533091399 (40.5 GiB) TX bytes:83359370885 (77.6 GiB)
> Memory:dfe60000-dfe80000
>
> eth1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:43531546 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1716151 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:7201915977 (6.7 GiB) TX bytes:232026383 (221.2 MiB)
> Memory:dfee0000-dff00000
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:10890707 errors:0 dropped:0 overruns:0 frame:0
> TX packets:10890707 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:36194379576 (33.7 GiB) TX bytes:36194379576 (33.7 GiB)
>
> tap0 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> UP BROADCAST MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:500
> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
>
> When i execute the following line
>
> hsaeed_at_karp:~/Task4_mpi/scatterv$ mpiexec -n 2 -host wirth,karp ./a.out
>
> i receive Error
>
> [wirth][[59430,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 10.231.2.231 failed: Connection refused (111)
>
>
> NOTE: Karp and wirth are two machines on ssh cluster.
>
>
>
>
> On Fri, Mar 21, 2014 at 3:13 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
> On Mar 21, 2014, at 10:09 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]> wrote:
>
> > > I think i have a tcp connection. As for as i know my cluster is not configured for Infiniband (IB).
>
> Ok.
>
> > > but even for tcp connections.
> > >
> > > mpirun -n 2 -host master,node001 --mca btl tcp,sm,self ./helloworldmpi
> > > mpirun -n 2 -host master,node001 ./helloworldmpi
> > >
> > > These line are not working they output
> > > Error like
> > > [btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to xx.xxx.x.xxx failed: Connection refused (111)
>
> What are the IP addresses reported by connect()? (i.e., the address you X'ed out)
>
> Send the output from ifconfig on each of your servers. Note that some Linux distributions do not put ifconfig in the default PATH of normal users; look for it in/sbin/ifconfig or /usr/sbin/ifconfig.
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> _______________________________________________
> Hamid Saeed
> CoSynth GmbH & Co. KG
> Escherweg 2 - 26121 Oldenburg - Germany
> Tel +49 441 9722 738 | Fax -278
> http://www.cosynth.com
> _______________________________________________
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/