Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error message related to infiniband
From: Syed Ahsan Ali (ahsanshah01_at_[hidden])
Date: 2014-01-20 05:11:05


My email was mixture of error messages/warnings.

IB Card on compute-01-10 is faulty on ibstatus.

In ibstat on other nodes as well as on compute-01-15 there are dual ports
as I see status of both ports in ibstat.

Firewall in not a problem, I am sure about it. How can I check bad ethernet
port. I mean I can ping among master and compute nodes.

/etc/hosts is ok for name resolution.

Thank you very much for responding and helping me out.

Ahsan

On Mon, Jan 20, 2014 at 9:27 AM, Gustavo Correa <gus_at_[hidden]>wrote:

> Is your IB card in compute-01-10.private.dns.zone working?
> Did you check it with ibstat?
>
> Do you have a dual port IB card in compute-01-15.private.dns.zone?
> Did you connect both ports to the same switch on the same subnet?
>
> TCP "no route to host":
> If it is not a firewall problem, could it bad Ethernet port on a node
> perhaps?
>
> Also, if you use host names in your hostfile, I guess they need to be able
> to
> resolve the names into IP addresses.
> Check if your /etc/hosts file, DNS server, or whatever you
> use for name resolution, is correct and consistent across the cluster.
>
> On Jan 19, 2014, at 10:18 PM, Syed Ahsan Ali wrote:
>
> > I agree with you and still struglling with subnet ID settings because I
> couldn't find /var/cache/opensm/opensm.opts file.
> >
> > Secondly, if OMPI is going for TCP then it should be able to find as
> compute nodes are available via ping and ssh
> >
> >
> > On Sun, Jan 19, 2014 at 9:38 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> > If OMPI finds infiniband support on the node, it will attempt to use it.
> In this case, it would appear you have an incorrectly configured IB adaptor
> on the node, so you get the additional warning about that fact.
> >
> > OMPI then falls back to look for another transport, in this case TCP.
> However, the TCP transport is unable to create a socket to the remote host.
> The most likely cause is a firewall, so you might want to check that and
> turn it off.
> >
> >
> > On Jan 19, 2014, at 4:19 AM, Syed Ahsan Ali <ahsanshah01_at_[hidden]>
> wrote:
> >
> >> Dear All
> >>
> >> I am getting infiniband errors while running mpirun applications on
> cluster. I get these errors even when I don't include infiniband usage
> flags in mpirun command. Please guide
> >>
> >> mpirun -np 72 -hostfile hostlist ../bin/regcmMPI regcm.in
> >>
> >>
> --------------------------------------------------------------------------
> >> [[59183,1],24]: A high-performance Open MPI point-to-point messaging
> module
> >> was unable to find any relevant network interfaces:
> >> Module: OpenFabrics (openib)
> >> Host: compute-01-10.private.dns.zone
> >>
> >> Another transport will be used instead, although this may result in
> >> lower performance.
> >>
> --------------------------------------------------------------------------
> >>
> --------------------------------------------------------------------------
> >> WARNING: There are more than one active ports on host
> 'compute-01-15.private.dns.zone', but the
> >> default subnet GID prefix was detected on more than one of these
> >> ports. If these ports are connected to different physical IB
> >> networks, this configuration will fail in Open MPI. This version of
> >> Open MPI requires that every physically separate IB subnet that is
> >> used between connected MPI processes must have different subnet ID
> >> values.
> >>
> >> Please see this FAQ entry for more details:
> >>
> >>
> http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
> >>
> >> NOTE: You can turn off this warning by setting the MCA parameter
> >> btl_openib_warn_default_gid_prefix to 0.
> >>
> --------------------------------------------------------------------------
> >>
> >> This is RegCM trunk
> >> SVN Revision: tag 4.3.5.6 compiled at: data : Sep 3 2013 time:
> 05:10:53
> >>
> >> [pmd.pakmet.com:03309] 15 more processes have sent help message
> help-mpi-btl-base.txt / btl:no-nics
> >> [pmd.pakmet.com:03309] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
> >> [pmd.pakmet.com:03309] 47 more processes have sent help message
> help-mpi-btl-openib.txt / default subnet prefix
> >>
> [compute-01-03.private.dns.zone][[59183,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >>
> [compute-01-03.private.dns.zone][[59183,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >>
> [compute-01-03.private.dns.zone][[59183,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >>
> [compute-01-03.private.dns.zone][[59183,1],3][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> [compute-01-03.private.dns.zone][[59183,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >>
> [compute-01-03.private.dns.zone][[59183,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >> connect() to 192.168.108.10 failed: No route to host (113)
> >>
> [compute-01-03.private.dns.zone][[59183,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >>
> [compute-01-03.private.dns.zone][[59183,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.108.10 failed: No route to host (113)
> >>
> >> Ahsan
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > Syed Ahsan Ali Bokhari
> > Electronic Engineer (EE)
> >
> > Research & Development Division
> > Pakistan Meteorological Department H-8/4, Islamabad.
> > Phone # off +92518358714
> > Cell # +923155145014
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>