Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fwd: problem for multiple clusters using mpirun
From: Hamid Saeed (e.hamidsaeed_at_[hidden])
Date: 2014-03-24 10:56:54


Hello,

I added the "self" e.g

hsaeed_at_karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca
btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv

Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
--------------------------------------------------------------------------

ERROR::

At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[15751,1],7]) is on host: wirth
  Process 2 ([[15751,1],0]) is on host: karp
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[wirth:40329] *** An error occurred in MPI_Init
[wirth:40329] *** on a NULL communicator
[wirth:40329] *** Unknown error
[wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

  Reason: Before MPI_INIT completed
  Local host: wirth
  PID: 40329
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 7 with PID 40329 on
node wirth exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[karp:29513] 1 more process has sent help message help-mca-bml-r2.txt /
unreachable proc
[karp:29513] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[karp:29513] 1 more process has sent help message help-mpi-runtime /
mpi_init:startup:pml-add-procs-fail
[karp:29513] 1 more process has sent help message help-mpi-errors.txt /
mpi_errors_are_fatal unknown handle
[karp:29513] 1 more process has sent help message help-mpi-runtime.txt /
ompi mpi abort:cannot guarantee all killed

I tried every combination for btl_tcp_if_include or exclude...

I cant figure out what is wrong.
I can easily talk with the remote pc using netcat.
I am sure i am very near to the solution but..

regards.

On Mon, Mar 24, 2014 at 3:25 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]
> wrote:

> If you you use btl_tcp_if_exclude, you also need to exclude the loopback
> interface. Loopback is excluded by the default value of
> btl_tcp_if_exclude, but if you overwrite that value, then you need to
> *also* include the loopback interface in the new value.
>
>
>
> On Mar 24, 2014, at 4:57 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]> wrote:
>
> > Hello,
> > Still i am facing problems.
> > I checked there is no firewall which is acting as a barrier for the mpi
> communication.
> >
> > even i used the execution line like
> > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpiexec -n 2 --mca btl_tcp_if_exclude
> br2 -host wirth,karp ./a.out
> >
> > Now the output hangup without displaying any error.
> >
> > Used "..exclude bt2" because the failed connection was from bt2 as you
> can see in the "ifconfig" output mentioned above.
> >
> > I know there is something wrong but i am almost unable to figure it out.
> >
> > I need some more kind suggestions.
> >
> > regards.
> >
> >
> > On Fri, Mar 21, 2014 at 6:05 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > Do you have any firewalling enabled on these machines? If so, you'll
> want to either disable it, or allow random TCP connections between any of
> the cluster nodes.
> >
> >
> > On Mar 21, 2014, at 10:24 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> >
> > > /sbin/ifconfig
> > >
> > > hsaeed_at_karp:~$ /sbin/ifconfig
> > > br0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> > > inet addr:134.106.3.231 Bcast:134.106.3.255
> Mask:255.255.255.0
> > > inet6 addr: fe80::225:90ff:fe59:c9ba/64 Scope:Link
> > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > RX packets:49080961 errors:0 dropped:50263 overruns:0 frame:0
> > > TX packets:43279252 errors:0 dropped:0 overruns:0 carrier:0
> > > collisions:0 txqueuelen:0
> > > RX bytes:41348407558 (38.5 GiB) TX bytes:80505842745 (74.9
> GiB)
> > >
> > > br1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> > > inet addr:134.106.53.231 Bcast:134.106.53.255
> Mask:255.255.255.0
> > > inet6 addr: fe80::225:90ff:fe59:c9bb/64 Scope:Link
> > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > RX packets:41573060 errors:0 dropped:50261 overruns:0 frame:0
> > > TX packets:1693509 errors:0 dropped:0 overruns:0 carrier:0
> > > collisions:0 txqueuelen:0
> > > RX bytes:6177072160 (5.7 GiB) TX bytes:230617435 (219.9 MiB)
> > >
> > > br2 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> > > inet addr:10.231.2.231 Bcast:10.231.2.255
> Mask:255.255.255.0
> > > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > > collisions:0 txqueuelen:0
> > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> > >
> > > eth0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba
> > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > RX packets:69108377 errors:0 dropped:0 overruns:0 frame:0
> > > TX packets:86459066 errors:0 dropped:0 overruns:0 carrier:0
> > > collisions:0 txqueuelen:1000
> > > RX bytes:43533091399 (40.5 GiB) TX bytes:83359370885 (77.6
> GiB)
> > > Memory:dfe60000-dfe80000
> > >
> > > eth1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb
> > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > > RX packets:43531546 errors:0 dropped:0 overruns:0 frame:0
> > > TX packets:1716151 errors:0 dropped:0 overruns:0 carrier:0
> > > collisions:0 txqueuelen:1000
> > > RX bytes:7201915977 (6.7 GiB) TX bytes:232026383 (221.2 MiB)
> > > Memory:dfee0000-dff00000
> > >
> > > lo Link encap:Local Loopback
> > > inet addr:127.0.0.1 Mask:255.0.0.0
> > > inet6 addr: ::1/128 Scope:Host
> > > UP LOOPBACK RUNNING MTU:16436 Metric:1
> > > RX packets:10890707 errors:0 dropped:0 overruns:0 frame:0
> > > TX packets:10890707 errors:0 dropped:0 overruns:0 carrier:0
> > > collisions:0 txqueuelen:0
> > > RX bytes:36194379576 (33.7 GiB) TX bytes:36194379576 (33.7
> GiB)
> > >
> > > tap0 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7
> > > UP BROADCAST MULTICAST MTU:1500 Metric:1
> > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> > > collisions:0 txqueuelen:500
> > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
> > >
> > > When i execute the following line
> > >
> > > hsaeed_at_karp:~/Task4_mpi/scatterv$ mpiexec -n 2 -host wirth,karp
> ./a.out
> > >
> > > i receive Error
> > >
> > >
> [wirth][[59430,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.231.2.231 failed: Connection refused (111)
> > >
> > >
> > > NOTE: Karp and wirth are two machines on ssh cluster.
> > >
> > >
> > >
> > >
> > > On Fri, Mar 21, 2014 at 3:13 PM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
> > > On Mar 21, 2014, at 10:09 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]>
> wrote:
> > >
> > > > > I think i have a tcp connection. As for as i know my cluster is
> not configured for Infiniband (IB).
> > >
> > > Ok.
> > >
> > > > > but even for tcp connections.
> > > > >
> > > > > mpirun -n 2 -host master,node001 --mca btl tcp,sm,self
> ./helloworldmpi
> > > > > mpirun -n 2 -host master,node001 ./helloworldmpi
> > > > >
> > > > > These line are not working they output
> > > > > Error like
> > > > > [btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to xx.xxx.x.xxx failed: Connection refused (111)
> > >
> > > What are the IP addresses reported by connect()? (i.e., the address
> you X'ed out)
> > >
> > > Send the output from ifconfig on each of your servers. Note that some
> Linux distributions do not put ifconfig in the default PATH of normal
> users; look for it in/sbin/ifconfig or /usr/sbin/ifconfig.
> > >
> > > --
> > > Jeff Squyres
> > > jsquyres_at_[hidden]
> > > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > >
> > > --
> > > _______________________________________________
> > > Hamid Saeed
> > > CoSynth GmbH & Co. KG
> > > Escherweg 2 - 26121 Oldenburg - Germany
> > > Tel +49 441 9722 738 | Fax -278
> > > http://www.cosynth.com
> > > _______________________________________________
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > _______________________________________________
> > Hamid Saeed
> > CoSynth GmbH & Co. KG
> > Escherweg 2 - 26121 Oldenburg - Germany
> > Tel +49 441 9722 738 | Fax -278
> > http://www.cosynth.com
> > _______________________________________________
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
_______________________________________________
Hamid Saeed