Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fwd: problem for multiple clusters using mpirun
From: Hamid Saeed (e.hamidsaeed_at_[hidden])
Date: 2014-03-25 05:23:33


Hello,
I am not sure what approach does the MPI communication follow but when i
use
--mca btl_base_verbose 30

I observe the mentioned port.

[karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on
port 4
[karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
connect() to 134.106.3.252 failed: Connection refused (111)

the information on the
http://www.open-mpi.org/community/lists/users/2011/11/17732.php
is not enough could you kindly explain..

How can restrict MPI communication to use the ports starting from 1025.
or use the port some what like
59822...

Regards.

On Tue, Mar 25, 2014 at 9:15 AM, Reuti <reuti_at_[hidden]> wrote:

> Hi,
>
> Am 25.03.2014 um 08:34 schrieb Hamid Saeed:
>
> > Is it possible to change the port number for the MPI communication?
> >
> > I can see that my program uses port 4 for the MPI communication.
> >
> > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252
> on port 4
> >
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 134.106.3.252 failed: Connection refused (111)
> >
> > In my case the ports from 1 to 1024 are reserved.
> > MPI tries to use one of the reserve ports and prompts the connection
> refused error.
> >
> > I will be very glade for the kind suggestions.
>
> There are certain parameters to set the range of used ports, but using any
> up to 1024 should not be the default:
>
> http://www.open-mpi.org/community/lists/users/2011/11/17732.php
>
> Are any of these set by accident beforehand by your environment?
>
> -- Reuti
>
>
> > Regards.
> >
> >
> >
> >
> >
> > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > Hello Jeff,
> >
> > Thanks for your cooperation.
> >
> > --mca btl_tcp_if_include br0
> >
> > worked out of the box.
> >
> > The problem was from the network administrator. The machines on the
> network side were halting the mpi...
> >
> > so cleaning and killing every thing worked.
> >
> > :)
> >
> > regards.
> >
> >
> > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > There is no "self" IP interface in the Linux kernel.
> >
> > Try using btl_tcp_if_include and list just the interface(s) that you
> want to use. From your prior email, I'm *guessing* it's just br2 (i.e.,
> the 10.x address inside your cluster).
> >
> > Also, it looks like you didn't setup your SSH keys properly for logging
> in to remote notes automatically.
> >
> >
> >
> > On Mar 24, 2014, at 10:56 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> >
> > > Hello,
> > >
> > > I added the "self" e.g
> > >
> > > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib
> --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth
> ./scatterv
> > >
> > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
> > >
> --------------------------------------------------------------------------
> > >
> > > ERROR::
> > >
> > > At least one pair of MPI processes are unable to reach each other for
> > > MPI communications. This means that no Open MPI device has indicated
> > > that it can be used to communicate between these processes. This is
> > > an error; Open MPI requires that all MPI processes be able to reach
> > > each other. This error can sometimes be the result of forgetting to
> > > specify the "self" BTL.
> > >
> > > Process 1 ([[15751,1],7]) is on host: wirth
> > > Process 2 ([[15751,1],0]) is on host: karp
> > > BTLs attempted: self sm
> > >
> > > Your MPI job is now going to abort; sorry.
> > >
> --------------------------------------------------------------------------
> > >
> --------------------------------------------------------------------------
> > > MPI_INIT has failed because at least one MPI process is unreachable
> > > from another. This *usually* means that an underlying communication
> > > plugin -- such as a BTL or an MTL -- has either not loaded or not
> > > allowed itself to be used. Your MPI job will now abort.
> > >
> > > You may wish to try to narrow down the problem;
> > >
> > > * Check the output of ompi_info to see which BTL/MTL plugins are
> > > available.
> > > * Run your application with MPI_THREAD_SINGLE.
> > > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> > > if using MTL-based communications) to see exactly which
> > > communication plugins were considered and/or discarded.
> > >
> --------------------------------------------------------------------------
> > > [wirth:40329] *** An error occurred in MPI_Init
> > > [wirth:40329] *** on a NULL communicator
> > > [wirth:40329] *** Unknown error
> > > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> > >
> --------------------------------------------------------------------------
> > > An MPI process is aborting at a time when it cannot guarantee that all
> > > of its peer processes in the job will be killed properly. You should
> > > double check that everything has shut down cleanly.
> > >
> > > Reason: Before MPI_INIT completed
> > > Local host: wirth
> > > PID: 40329
> > >
> --------------------------------------------------------------------------
> > >
> --------------------------------------------------------------------------
> > > mpirun has exited due to process rank 7 with PID 40329 on
> > > node wirth exiting improperly. There are two reasons this could occur:
> > >
> > > 1. this process did not call "init" before exiting, but others in
> > > the job did. This can cause a job to hang indefinitely while it waits
> > > for all processes to call "init". By rule, if one process calls "init",
> > > then ALL processes must call "init" prior to termination.
> > >
> > > 2. this process called "init", but exited without calling "finalize".
> > > By rule, all processes that call "init" MUST call "finalize" prior to
> > > exiting or it will be considered an "abnormal termination"
> > >
> > > This may have caused other processes in the application to be
> > > terminated by signals sent by mpirun (as reported here).
> > >
> --------------------------------------------------------------------------
> > > [karp:29513] 1 more process has sent help message help-mca-bml-r2.txt
> / unreachable proc
> > > [karp:29513] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> > > [karp:29513] 1 more process has sent help message help-mpi-runtime /
> mpi_init:startup:pml-add-procs-fail
> > > [karp:29513] 1 more process has sent help message help-mpi-errors.txt
> / mpi_errors_are_fatal unknown handle
> > > [karp:29513] 1 more process has sent help message help-mpi-runtime.txt
> / ompi mpi abort:cannot guarantee all killed
> > >
> > > I tried every combination for btl_tcp_if_include or exclude...
> > >
> > > I cant figure out what is wrong.
> > > I can easily talk with the remote pc using netcat.
> > > I am sure i am very near to the solution but..
> > >
> > > regards.
> > >
> > >
> > >
> > > On Mon, Mar 24, 2014 at 3:25 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > If you you use btl_tcp_if_exclude, you also need to exclude the
> loopback interface. Loopback is excluded by the default value of
> btl_tcp_if_exclude, but if you overwrite that value, then you need to
> *also* include the loopback interface in the new value.
> > >
> > >
> > >
> > > On Mar 24, 2014, at 4:57 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > >
> > > > Hello,
> > > > Still i am facing problems.
> > > > I checked there is no firewall which is acting as a barrier for the
> mpi communication.
> > > >
> > > > even i used the execution line like
> > > > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpiexec -n 2 --mca
> btl_tcp_if_exclude br2 -host wirth,karp ./a.out
> > > >
> > > > Now the output hangup without displaying any error.
> > > >
> > > > Used "..exclude bt2" because the failed connection was from bt2 as
> you can see in the "ifconfig" output mentioned above.
> > > >
> > > > I know there is something wrong but i am almost unable to figure it
> out.
> > > >
> > > > I need some more kind suggestions.
> > > >
> > > > regards.
> > > >
> > > >
> > > > On Fri, Mar 21, 2014 at 6:05 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > > Do you have any firewalling enabled on these machines? If so,
> you'll want to either disable it, or allow random TCP connections between
> any of the cluster nodes.
> > > >
> > > >
> > > > On Mar 21, 2014, at 10:24 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > > >
> > > > > /sbin/ifconfig
> > > > >
> > > > > hsaeed_at_karp:~$ /sbin/ifconfig
> > > > > br0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> > > > > inet addr:134.106.3.231 Bcast:134.106.3.255
> Mask:255.255.255.0
> > > > > inet6 addr: fe80::225:90ff:fe59:c9ba/64 Scope:Link
> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > RX packets:49080961 errors:0 dropped:50263 overruns:0
> frame:0
> > > > > TX packets:43279252 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > collisions:0 txqueuelen:0
> > > > > RX bytes:41348407558 (38.5 GiB) TX bytes:80505842745
> (74.9 GiB)
> > > > >
> > > > > br1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> > > > > inet addr:134.106.53.231 Bcast:134.106.53.255
> Mask:255.255.255.0
> > > > > inet6 addr: fe80::225:90ff:fe59:c9bb/64 Scope:Link
> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > RX packets:41573060 errors:0 dropped:50261 overruns:0
> frame:0
> > > > > TX packets:1693509 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > collisions:0 txqueuelen:0
> > > > > RX bytes:6177072160 (5.7 GiB) TX bytes:230617435 (219.9
> MiB)
> > > > >
> > > > > br2 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> > > > > inet addr:10.231.2.231 Bcast:10.231.2.255
> Mask:255.255.255.0
> > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > > > > collisions:0 txqueuelen:0
> > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> > > > >
> > > > > eth0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > RX packets:69108377 errors:0 dropped:0 overruns:0 frame:0
> > > > > TX packets:86459066 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > collisions:0 txqueuelen:1000
> > > > > RX bytes:43533091399 (40.5 GiB) TX bytes:83359370885
> (77.6 GiB)
> > > > > Memory:dfe60000-dfe80000
> > > > >
> > > > > eth1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > RX packets:43531546 errors:0 dropped:0 overruns:0 frame:0
> > > > > TX packets:1716151 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > collisions:0 txqueuelen:1000
> > > > > RX bytes:7201915977 (6.7 GiB) TX bytes:232026383 (221.2
> MiB)
> > > > > Memory:dfee0000-dff00000
> > > > >
> > > > > lo Link encap:Local Loopback
> > > > > inet addr:127.0.0.1 Mask:255.0.0.0
> > > > > inet6 addr: ::1/128 Scope:Host
> > > > > UP LOOPBACK RUNNING MTU:16436 Metric:1
> > > > > RX packets:10890707 errors:0 dropped:0 overruns:0 frame:0
> > > > > TX packets:10890707 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > collisions:0 txqueuelen:0
> > > > > RX bytes:36194379576 (33.7 GiB) TX bytes:36194379576
> (33.7 GiB)
> > > > >
> > > > > tap0 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > > > > collisions:0 txqueuelen:500
> > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> > > > >
> > > > > When i execute the following line
> > > > >
> > > > > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpiexec -n 2 -host wirth,karp
> ./a.out
> > > > >
> > > > > i receive Error
> > > > >
> > > > >
> [wirth][[59430,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.231.2.231 failed: Connection refused (111)
> > > > >
> > > > >
> > > > > NOTE: Karp and wirth are two machines on ssh cluster.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Mar 21, 2014 at 3:13 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > > > On Mar 21, 2014, at 10:09 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > > > >
> > > > > > > I think i have a tcp connection. As for as i know my cluster
> is not configured for Infiniband (IB).
> > > > >
> > > > > Ok.
> > > > >
> > > > > > > but even for tcp connections.
> > > > > > >
> > > > > > > mpirun -n 2 -host master,node001 --mca btl tcp,sm,self
> ./helloworldmpi
> > > > > > > mpirun -n 2 -host master,node001 ./helloworldmpi
> > > > > > >
> > > > > > > These line are not working they output
> > > > > > > Error like
> > > > > > > [btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to xx.xxx.x.xxx failed: Connection refused (111)
> > > > >
> > > > > What are the IP addresses reported by connect()? (i.e., the
> address you X'ed out)
> > > > >
> > > > > Send the output from ifconfig on each of your servers. Note that
> some Linux distributions do not put ifconfig in the default PATH of normal
> users; look for it in/sbin/ifconfig or /usr/sbin/ifconfig.
> > > > >
> > > > > --
> > > > > Jeff Squyres
> > > > > jsquyres_at_[hidden]
> > > > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > > > >
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > _______________________________________________
> > > > > Hamid Saeed
> > > > > CoSynth GmbH & Co. KG
> > > > > Escherweg 2 - 26121 Oldenburg - Germany
> > > > > Tel +49 441 9722 738 | Fax -278
> > > > > http://www.cosynth.com
> > > > > _______________________________________________
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > >
> > > > --
> > > > Jeff Squyres
> > > > jsquyres_at_[hidden]
> > > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > >
> > > >
> > > > --
> > > > _______________________________________________
> > > > Hamid Saeed
> > > > CoSynth GmbH & Co. KG
> > > > Escherweg 2 - 26121 Oldenburg - Germany
> > > > Tel +49 441 9722 738 | Fax -278
> > > > http://www.cosynth.com
> > > > _______________________________________________
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > > --
> > > Jeff Squyres
> > > jsquyres_at_[hidden]
> > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > >
> > > --
> > > _______________________________________________
> > > Hamid Saeed
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > _______________________________________________
> > Hamid Saeed
> > _______________________________________________
> >
> >
> >
> > --
> > _______________________________________________
> > Hamid Saeed
> > ______________________________________________
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
_______________________________________________
Hamid Saeed
CoSynth GmbH & Co. KG
Escherweg 2 - 26121 Oldenburg - Germany
Tel +49 441 9722 738 | Fax -278
http://www.cosynth.com
_______________________________________________