Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-09-11 11:14:35


I'm not sure what USE=-threads means, but I would discourage the use
of threads in the v1.2 series; our thread support is pretty much
broken in the 1.2 series.

On Sep 10, 2008, at 7:52 PM, Eric Thibodeau wrote:

> Prasanna, also make sure you try with USE=-threads ...as the ebuild
> states, it's _experimental_ ;)
>
> Keep your eye on: https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport
>
> Eric
>
> Prasanna Ranganathan wrote:
>>
>> Hi,
>>
>> I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
>> 1.2.6-r1 to be the latest stable version of openMPI).
>>
>> I do still get the following error message when running my test
>> helloWorld
>> program:
>>
>> [10.12.77.21][0,1,95][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_c
>> onnect] connect() failed with
>> errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_
>> complete_connect] connect() failed with errno=113
>> [10.12.77.15][0,1,89][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_c
>> onnect] connect() failed with errno=113
>> [10.12.77.22][0,1,96][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_c
>> onnect] connect() failed with errno=113
>>
>> Again, this error does not happen with every run of the test
>> program and
>> occurs only certain times.
>>
>> How do I take care of this?
>>
>> Regards,
>>
>> Prasanna.
>>
>>
>> On 9/9/08 9:00 AM, "users-request_at_[hidden]" <users-request_at_[hidden]
>> >
>> wrote:
>>
>>
>>> Message: 1
>>> Date: Mon, 8 Sep 2008 16:43:33 -0400
>>> From: Jeff Squyres <jsquyres_at_[hidden]>
>>> Subject: Re: [OMPI users] Need help resolving No route to host error
>>> with OpenMPI 1.1.2
>>> To: Open MPI Users <users_at_[hidden]>
>>> Message-ID: <AF302D68-0D30-469E-AFD3-566FF962814B_at_[hidden]>
>>> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
>>> delsp=yes
>>>
>>> Are you able to upgrade to Open MPI v1.2.7?
>>>
>>> There were *many* bug fixes and changes in the 1.2 series compared
>>> to
>>> the 1.1 series, some, in particular, were dealing with TCP socket
>>> timeouts (which are important when dealing with large numbers of MPI
>>> processes).
>>>
>>>
>>>
>>> On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> I am trying to run a test mpiHelloWorld program that simply
>>>> initializes the MPI environment on all the nodes, prints the
>>>> hostname and rank of each node in the MPI process group and exits.
>>>>
>>>> I am using MPI 1.1.2 and am running 997 processes on 499 nodes
>>>> (Nodes have 2 dual core CPUs).
>>>>
>>>> I get the following error messages when I run my program as
>>>> follows:
>>>> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
>>>> .....
>>>> .....
>>>> .....
>>>> [0,1,380][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
>>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>>> [0,1,140][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
>>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with errno=113
>>>> connect() failed with errno=113connect() failed with
>>>> errno=113connect() failed with errno=113[0,1,138]
>>>> [btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
>>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with errno=113
>>>> [0,1,388][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>>> errno=113[0,1,386][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>>> errno=113
>>>> [0,1,139][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>>> errno=113
>>>> connect() failed with errno=113
>>>> .....
>>>> .....
>>>>
>>>> The main thing is that I get these error messages around 3-4 times
>>>> out of 10 attempts with the rest all completing successfully. I
>>>> have
>>>> looked into the FAQs in detail and also checked the tcp btl
>>>> settings
>>>> but am not able to figure it out.
>>>>
>>>> All the 499 nodes have only eth0 active and I get the error even
>>>> when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
>>>> --mca btl_tcp_if_include eth0 /main/mpiHelloWorld
>>>>
>>>> I have attached the output of ompi_info ?all.
>>>>
>>>> The following is the output of /sbin/ifconfig on the node where I
>>>> start the mpi process (it is one of the 499 nodes)
>>>>
>>>> eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6
>>>> inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:
>>>> 255.255.0.0
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:1978724556 errors:17 dropped:0 overruns:0
>>>> frame:
>>>> 17
>>>> TX packets:1767028063 errors:0 dropped:0 overruns:0
>>>> carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:580938897359 (554026.5 Mb) TX bytes:
>>>> 689318600552
>>>> (657385.4 Mb)
>>>> Interrupt:22 Base address:0xc000
>>>>
>>>> lo Link encap:Local Loopback
>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>> RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:0
>>>> RX bytes:339687635 (323.9 Mb) TX bytes:339687635
>>>> (323.9 Mb)
>>>>
>>>>
>>>> Kindly help.
>>>>
>>>> Regards,
>>>>
>>>> Prasanna.
>>>>
>>>> <ompi_info.rtf>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems