Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error message related to infiniband
From: Gustavo Correa (gus_at_[hidden])
Date: 2014-01-19 23:27:44


Is your IB card in compute-01-10.private.dns.zone working?
Did you check it with ibstat?

Do you have a dual port IB card in compute-01-15.private.dns.zone?
Did you connect both ports to the same switch on the same subnet?

TCP "no route to host":
If it is not a firewall problem, could it bad Ethernet port on a node perhaps?

Also, if you use host names in your hostfile, I guess they need to be able to
resolve the names into IP addresses.
Check if your /etc/hosts file, DNS server, or whatever you
use for name resolution, is correct and consistent across the cluster.

On Jan 19, 2014, at 10:18 PM, Syed Ahsan Ali wrote:

> I agree with you and still struglling with subnet ID settings because I couldn't find /var/cache/opensm/opensm.opts file.
>
> Secondly, if OMPI is going for TCP then it should be able to find as compute nodes are available via ping and ssh
>
>
> On Sun, Jan 19, 2014 at 9:38 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> If OMPI finds infiniband support on the node, it will attempt to use it. In this case, it would appear you have an incorrectly configured IB adaptor on the node, so you get the additional warning about that fact.
>
> OMPI then falls back to look for another transport, in this case TCP. However, the TCP transport is unable to create a socket to the remote host. The most likely cause is a firewall, so you might want to check that and turn it off.
>
>
> On Jan 19, 2014, at 4:19 AM, Syed Ahsan Ali <ahsanshah01_at_[hidden]> wrote:
>
>> Dear All
>>
>> I am getting infiniband errors while running mpirun applications on cluster. I get these errors even when I don't include infiniband usage flags in mpirun command. Please guide
>>
>> mpirun -np 72 -hostfile hostlist ../bin/regcmMPI regcm.in
>>
>> --------------------------------------------------------------------------
>> [[59183,1],24]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>> Module: OpenFabrics (openib)
>> Host: compute-01-10.private.dns.zone
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> WARNING: There are more than one active ports on host 'compute-01-15.private.dns.zone', but the
>> default subnet GID prefix was detected on more than one of these
>> ports. If these ports are connected to different physical IB
>> networks, this configuration will fail in Open MPI. This version of
>> Open MPI requires that every physically separate IB subnet that is
>> used between connected MPI processes must have different subnet ID
>> values.
>>
>> Please see this FAQ entry for more details:
>>
>> http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
>>
>> NOTE: You can turn off this warning by setting the MCA parameter
>> btl_openib_warn_default_gid_prefix to 0.
>> --------------------------------------------------------------------------
>>
>> This is RegCM trunk
>> SVN Revision: tag 4.3.5.6 compiled at: data : Sep 3 2013 time: 05:10:53
>>
>> [pmd.pakmet.com:03309] 15 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
>> [pmd.pakmet.com:03309] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> [pmd.pakmet.com:03309] 47 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
>> [compute-01-03.private.dns.zone][[59183,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113)
>> [compute-01-03.private.dns.zone][[59183,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113)
>> [compute-01-03.private.dns.zone][[59183,1],5][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113)
>> [compute-01-03.private.dns.zone][[59183,1],3][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] [compute-01-03.private.dns.zone][[59183,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113)
>> [compute-01-03.private.dns.zone][[59183,1],7][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113)
>> connect() to 192.168.108.10 failed: No route to host (113)
>> [compute-01-03.private.dns.zone][[59183,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113)
>> [compute-01-03.private.dns.zone][[59183,1],4][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.108.10 failed: No route to host (113)
>>
>> Ahsan
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
>
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off +92518358714
> Cell # +923155145014
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users