Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem for multiple clusters using mpirun
From: Hamid Saeed (e.hamidsaeed_at_[hidden])
Date: 2014-03-31 04:32:56


Yes Jeff,
You were right. The default value for btl_tcp_port_min_v4 is 1024.

I was facing problem in running my Algorithm on multiple processors (using
ssh).

Answer:
The network administrator locked that port.
:(

i changed the communication port by forcing mpi to use another.

mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca btl_tcp_if_include
br0 --mca btl_tcp_port_min_v4 10000 ./a.out

Thanks again for the nice and effective suggestions.

Regards.

On Tue, Mar 25, 2014 at 1:27 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]
> wrote:

> This is very odd -- the default value for btl_tcp_port_min_v4 is 1024. So
> unless you have overridden this value, you should not be getting a port
> less than 1024. You can run this to see:
>
> ompi_info --level 9 --param btl tcp --parsable | grep port_min_v4
>
> Mine says this in a default 1.7.5 installation:
>
> mca:btl:tcp:param:btl_tcp_port_min_v4:value:1024
> mca:btl:tcp:param:btl_tcp_port_min_v4:source:default
> mca:btl:tcp:param:btl_tcp_port_min_v4:status:writeable
> mca:btl:tcp:param:btl_tcp_port_min_v4:level:2
> mca:btl:tcp:param:btl_tcp_port_min_v4:help:The minimum port where the TCP
> BTL will try to bind (default 1024)
> mca:btl:tcp:param:btl_tcp_port_min_v4:deprecated:no
> mca:btl:tcp:param:btl_tcp_port_min_v4:type:int
> mca:btl:tcp:param:btl_tcp_port_min_v4:disabled:false
>
>
>
> On Mar 25, 2014, at 5:36 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]> wrote:
>
> > Hello,
> > Thanks i figured out what was the exact problem in my case.
> > Now i am using the following execution line.
> > it is directing the mpi comm port to start from 10000...
> >
> > mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca
> btl_tcp_if_include br0 --mca btl_tcp_port_min_v4 10000 ./a.out
> >
> > and every thing works again.
> >
> > Thanks.
> >
> > Best regards.
> >
> >
> >
> >
> > On Tue, Mar 25, 2014 at 10:23 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > Hello,
> > I am not sure what approach does the MPI communication follow but when i
> > use
> > --mca btl_base_verbose 30
> >
> > I observe the mentioned port.
> >
> > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252
> on port 4
> >
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 134.106.3.252 failed: Connection refused (111)
> >
> >
> > the information on the
> > http://www.open-mpi.org/community/lists/users/2011/11/17732.php
> > is not enough could you kindly explain..
> >
> > How can restrict MPI communication to use the ports starting from 1025.
> > or use the port some what like
> > 59822...
> >
> > Regards.
> >
> >
> >
> > On Tue, Mar 25, 2014 at 9:15 AM, Reuti <reuti_at_[hidden]>
> wrote:
> > Hi,
> >
> > Am 25.03.2014 um 08:34 schrieb Hamid Saeed:
> >
> > > Is it possible to change the port number for the MPI communication?
> > >
> > > I can see that my program uses port 4 for the MPI communication.
> > >
> > > [karp:23756] btl: tcp: attempting to connect() to address
> 134.106.3.252 on port 4
> > >
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 134.106.3.252 failed: Connection refused (111)
> > >
> > > In my case the ports from 1 to 1024 are reserved.
> > > MPI tries to use one of the reserve ports and prompts the connection
> refused error.
> > >
> > > I will be very glade for the kind suggestions.
> >
> > There are certain parameters to set the range of used ports, but using
> any up to 1024 should not be the default:
> >
> > http://www.open-mpi.org/community/lists/users/2011/11/17732.php
> >
> > Are any of these set by accident beforehand by your environment?
> >
> > -- Reuti
> >
> >
> > > Regards.
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > > Hello Jeff,
> > >
> > > Thanks for your cooperation.
> > >
> > > --mca btl_tcp_if_include br0
> > >
> > > worked out of the box.
> > >
> > > The problem was from the network administrator. The machines on the
> network side were halting the mpi...
> > >
> > > so cleaning and killing every thing worked.
> > >
> > > :)
> > >
> > > regards.
> > >
> > >
> > > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > There is no "self" IP interface in the Linux kernel.
> > >
> > > Try using btl_tcp_if_include and list just the interface(s) that you
> want to use. From your prior email, I'm *guessing* it's just br2 (i.e.,
> the 10.x address inside your cluster).
> > >
> > > Also, it looks like you didn't setup your SSH keys properly for
> logging in to remote notes automatically.
> > >
> > >
> > >
> > > On Mar 24, 2014, at 10:56 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > >
> > > > Hello,
> > > >
> > > > I added the "self" e.g
> > > >
> > > > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib
> --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth
> ./scatterv
> > > >
> > > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
> > > >
> --------------------------------------------------------------------------
> > > >
> > > > ERROR::
> > > >
> > > > At least one pair of MPI processes are unable to reach each other for
> > > > MPI communications. This means that no Open MPI device has indicated
> > > > that it can be used to communicate between these processes. This is
> > > > an error; Open MPI requires that all MPI processes be able to reach
> > > > each other. This error can sometimes be the result of forgetting to
> > > > specify the "self" BTL.
> > > >
> > > > Process 1 ([[15751,1],7]) is on host: wirth
> > > > Process 2 ([[15751,1],0]) is on host: karp
> > > > BTLs attempted: self sm
> > > >
> > > > Your MPI job is now going to abort; sorry.
> > > >
> --------------------------------------------------------------------------
> > > >
> --------------------------------------------------------------------------
> > > > MPI_INIT has failed because at least one MPI process is unreachable
> > > > from another. This *usually* means that an underlying communication
> > > > plugin -- such as a BTL or an MTL -- has either not loaded or not
> > > > allowed itself to be used. Your MPI job will now abort.
> > > >
> > > > You may wish to try to narrow down the problem;
> > > >
> > > > * Check the output of ompi_info to see which BTL/MTL plugins are
> > > > available.
> > > > * Run your application with MPI_THREAD_SINGLE.
> > > > * Set the MCA parameter btl_base_verbose to 100 (or
> mtl_base_verbose,
> > > > if using MTL-based communications) to see exactly which
> > > > communication plugins were considered and/or discarded.
> > > >
> --------------------------------------------------------------------------
> > > > [wirth:40329] *** An error occurred in MPI_Init
> > > > [wirth:40329] *** on a NULL communicator
> > > > [wirth:40329] *** Unknown error
> > > > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> > > >
> --------------------------------------------------------------------------
> > > > An MPI process is aborting at a time when it cannot guarantee that
> all
> > > > of its peer processes in the job will be killed properly. You should
> > > > double check that everything has shut down cleanly.
> > > >
> > > > Reason: Before MPI_INIT completed
> > > > Local host: wirth
> > > > PID: 40329
> > > >
> --------------------------------------------------------------------------
> > > >
> --------------------------------------------------------------------------
> > > > mpirun has exited due to process rank 7 with PID 40329 on
> > > > node wirth exiting improperly. There are two reasons this could
> occur:
> > > >
> > > > 1. this process did not call "init" before exiting, but others in
> > > > the job did. This can cause a job to hang indefinitely while it waits
> > > > for all processes to call "init". By rule, if one process calls
> "init",
> > > > then ALL processes must call "init" prior to termination.
> > > >
> > > > 2. this process called "init", but exited without calling "finalize".
> > > > By rule, all processes that call "init" MUST call "finalize" prior to
> > > > exiting or it will be considered an "abnormal termination"
> > > >
> > > > This may have caused other processes in the application to be
> > > > terminated by signals sent by mpirun (as reported here).
> > > >
> --------------------------------------------------------------------------
> > > > [karp:29513] 1 more process has sent help message
> help-mca-bml-r2.txt / unreachable proc
> > > > [karp:29513] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
> > > > [karp:29513] 1 more process has sent help message help-mpi-runtime /
> mpi_init:startup:pml-add-procs-fail
> > > > [karp:29513] 1 more process has sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
> > > > [karp:29513] 1 more process has sent help message
> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
> > > >
> > > > I tried every combination for btl_tcp_if_include or exclude...
> > > >
> > > > I cant figure out what is wrong.
> > > > I can easily talk with the remote pc using netcat.
> > > > I am sure i am very near to the solution but..
> > > >
> > > > regards.
> > > >
> > > >
> > > >
> > > > On Mon, Mar 24, 2014 at 3:25 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > > If you you use btl_tcp_if_exclude, you also need to exclude the
> loopback interface. Loopback is excluded by the default value of
> btl_tcp_if_exclude, but if you overwrite that value, then you need to
> *also* include the loopback interface in the new value.
> > > >
> > > >
> > > >
> > > > On Mar 24, 2014, at 4:57 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > > >
> > > > > Hello,
> > > > > Still i am facing problems.
> > > > > I checked there is no firewall which is acting as a barrier for
> the mpi communication.
> > > > >
> > > > > even i used the execution line like
> > > > > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpiexec -n 2 --mca
> btl_tcp_if_exclude br2 -host wirth,karp ./a.out
> > > > >
> > > > > Now the output hangup without displaying any error.
> > > > >
> > > > > Used "..exclude bt2" because the failed connection was from bt2 as
> you can see in the "ifconfig" output mentioned above.
> > > > >
> > > > > I know there is something wrong but i am almost unable to figure
> it out.
> > > > >
> > > > > I need some more kind suggestions.
> > > > >
> > > > > regards.
> > > > >
> > > > >
> > > > > On Fri, Mar 21, 2014 at 6:05 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > > > Do you have any firewalling enabled on these machines? If so,
> you'll want to either disable it, or allow random TCP connections between
> any of the cluster nodes.
> > > > >
> > > > >
> > > > > On Mar 21, 2014, at 10:24 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > > > >
> > > > > > /sbin/ifconfig
> > > > > >
> > > > > > hsaeed_at_karp:~$ /sbin/ifconfig
> > > > > > br0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> > > > > > inet addr:134.106.3.231 Bcast:134.106.3.255
> Mask:255.255.255.0
> > > > > > inet6 addr: fe80::225:90ff:fe59:c9ba/64 Scope:Link
> > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > > RX packets:49080961 errors:0 dropped:50263 overruns:0
> frame:0
> > > > > > TX packets:43279252 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > > collisions:0 txqueuelen:0
> > > > > > RX bytes:41348407558 (38.5 GiB) TX bytes:80505842745
> (74.9 GiB)
> > > > > >
> > > > > > br1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> > > > > > inet addr:134.106.53.231 Bcast:134.106.53.255
> Mask:255.255.255.0
> > > > > > inet6 addr: fe80::225:90ff:fe59:c9bb/64 Scope:Link
> > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > > RX packets:41573060 errors:0 dropped:50261 overruns:0
> frame:0
> > > > > > TX packets:1693509 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > > collisions:0 txqueuelen:0
> > > > > > RX bytes:6177072160 (5.7 GiB) TX bytes:230617435
> (219.9 MiB)
> > > > > >
> > > > > > br2 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> > > > > > inet addr:10.231.2.231 Bcast:10.231.2.255
> Mask:255.255.255.0
> > > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > > > > > collisions:0 txqueuelen:0
> > > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> > > > > >
> > > > > > eth0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > > RX packets:69108377 errors:0 dropped:0 overruns:0
> frame:0
> > > > > > TX packets:86459066 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > > collisions:0 txqueuelen:1000
> > > > > > RX bytes:43533091399 (40.5 GiB) TX bytes:83359370885
> (77.6 GiB)
> > > > > > Memory:dfe60000-dfe80000
> > > > > >
> > > > > > eth1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > > > > RX packets:43531546 errors:0 dropped:0 overruns:0
> frame:0
> > > > > > TX packets:1716151 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > > collisions:0 txqueuelen:1000
> > > > > > RX bytes:7201915977 (6.7 GiB) TX bytes:232026383
> (221.2 MiB)
> > > > > > Memory:dfee0000-dff00000
> > > > > >
> > > > > > lo Link encap:Local Loopback
> > > > > > inet addr:127.0.0.1 Mask:255.0.0.0
> > > > > > inet6 addr: ::1/128 Scope:Host
> > > > > > UP LOOPBACK RUNNING MTU:16436 Metric:1
> > > > > > RX packets:10890707 errors:0 dropped:0 overruns:0
> frame:0
> > > > > > TX packets:10890707 errors:0 dropped:0 overruns:0
> carrier:0
> > > > > > collisions:0 txqueuelen:0
> > > > > > RX bytes:36194379576 (33.7 GiB) TX bytes:36194379576
> (33.7 GiB)
> > > > > >
> > > > > > tap0 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> > > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > > > > > collisions:0 txqueuelen:500
> > > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> > > > > >
> > > > > > When i execute the following line
> > > > > >
> > > > > > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpiexec -n 2 -host wirth,karp
> ./a.out
> > > > > >
> > > > > > i receive Error
> > > > > >
> > > > > >
> [wirth][[59430,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.231.2.231 failed: Connection refused (111)
> > > > > >
> > > > > >
> > > > > > NOTE: Karp and wirth are two machines on ssh cluster.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 21, 2014 at 3:13 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > > > > On Mar 21, 2014, at 10:09 AM, Hamid Saeed <
> e.hamidsaeed_at_[hidden]> wrote:
> > > > > >
> > > > > > > > I think i have a tcp connection. As for as i know my cluster
> is not configured for Infiniband (IB).
> > > > > >
> > > > > > Ok.
> > > > > >
> > > > > > > > but even for tcp connections.
> > > > > > > >
> > > > > > > > mpirun -n 2 -host master,node001 --mca btl tcp,sm,self
> ./helloworldmpi
> > > > > > > > mpirun -n 2 -host master,node001 ./helloworldmpi
> > > > > > > >
> > > > > > > > These line are not working they output
> > > > > > > > Error like
> > > > > > > >
> [btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to
> xx.xxx.x.xxx failed: Connection refused (111)
> > > > > >
> > > > > > What are the IP addresses reported by connect()? (i.e., the
> address you X'ed out)
> > > > > >
> > > > > > Send the output from ifconfig on each of your servers. Note
> that some Linux distributions do not put ifconfig in the default PATH of
> normal users; look for it in/sbin/ifconfig or /usr/sbin/ifconfig.
> > > > > >
> > > > > > --
> > > > > > Jeff Squyres
> > > > > > jsquyres_at_[hidden]
> > > > > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > > > > >
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > users_at_[hidden]
> > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > _______________________________________________
> > > > > > Hamid Saeed
> > > > > > CoSynth GmbH & Co. KG
> > > > > > Escherweg 2 - 26121 Oldenburg - Germany
> > > > > > Tel +49 441 9722 738 | Fax -278
> > > > > > http://www.cosynth.com
> > > > > > _______________________________________________
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > users_at_[hidden]
> > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > >
> > > > >
> > > > > --
> > > > > Jeff Squyres
> > > > > jsquyres_at_[hidden]
> > > > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > > > >
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > _______________________________________________
> > > > > Hamid Saeed
> > > > > CoSynth GmbH & Co. KG
> > > > > Escherweg 2 - 26121 Oldenburg - Germany
> > > > > Tel +49 441 9722 738 | Fax -278
> > > > > http://www.cosynth.com
> > > > > _______________________________________________
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > >
> > > > --
> > > > Jeff Squyres
> > > > jsquyres_at_[hidden]
> > > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > >
> > > >
> > > > --
> > > > _______________________________________________
> > > > Hamid Saeed
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > > --
> > > Jeff Squyres
> > > jsquyres_at_[hidden]
> > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > >
> > > --
> > > _______________________________________________
> > > Hamid Saeed
> > > _______________________________________________
> > >
> > >
> > >
> > > --
> > > _______________________________________________
> > > Hamid Saeed
> > > ______________________________________________
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > _______________________________________________
> > Hamid Saeed
> > CoSynth GmbH & Co. KG
> > Escherweg 2 - 26121 Oldenburg - Germany
> > Tel +49 441 9722 738 | Fax -278
> > http://www.cosynth.com
> > _______________________________________________
> >
> >
> >
> > --
> > _______________________________________________
> > Hamid Saeed
> > _______________________________________________
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
_______________________________________________
Hamid Saeed
_______________________________________________