Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Eric Thibodeau (kyron_at_[hidden])
Date: 2008-09-11 12:04:37


Jeff Squyres wrote:
> I'm not sure what USE=-threads means, but I would discourage the use
> of threads in the v1.2 series; our thread support is pretty much
> broken in the 1.2 series.
That's exactly what it means, hence the following BFW I had originally
inserted in the package to this effect:

        ewarn
        ewarn "WARNING: use of threads is still disabled by default in"
        ewarn "upstream builds."
        ewarn "You may stop now and set USE=-threads"
        ewarn
        epause 5

...ok, so it's maybe not that B and F but it's still there to be noticed
and logged ;)
>
>
> On Sep 10, 2008, at 7:52 PM, Eric Thibodeau wrote:
>
>> Prasanna, also make sure you try with USE=-threads ...as the ebuild
>> states, it's _experimental_ ;)
>>
>> Keep your eye on:
>> https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport
>>
>> Eric
>>
>> Prasanna Ranganathan wrote:
>>>
>>> Hi,
>>>
>>> I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
>>> 1.2.6-r1 to be the latest stable version of openMPI).
>>>
>>> I do still get the following error message when running my test
>>> helloWorld
>>> program:
>>>
>>> [10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
>>>
>>> onnect] connect() failed with
>>> errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_
>>>
>>> complete_connect] connect() failed with errno=113
>>> [10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
>>>
>>> onnect] connect() failed with errno=113
>>> [10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
>>>
>>> onnect] connect() failed with errno=113
>>>
>>> Again, this error does not happen with every run of the test program
>>> and
>>> occurs only certain times.
>>>
>>> How do I take care of this?
>>>
>>> Regards,
>>>
>>> Prasanna.
>>>
>>>
>>> On 9/9/08 9:00 AM, "users-request_at_[hidden]"
>>> <users-request_at_[hidden]>
>>> wrote:
>>>
>>>
>>>> Message: 1
>>>> Date: Mon, 8 Sep 2008 16:43:33 -0400
>>>> From: Jeff Squyres <jsquyres_at_[hidden]>
>>>> Subject: Re: [OMPI users] Need help resolving No route to host error
>>>> with OpenMPI 1.1.2
>>>> To: Open MPI Users <users_at_[hidden]>
>>>> Message-ID: <AF302D68-0D30-469E-AFD3-566FF962814B_at_[hidden]>
>>>> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
>>>> delsp=yes
>>>>
>>>> Are you able to upgrade to Open MPI v1.2.7?
>>>>
>>>> There were *many* bug fixes and changes in the 1.2 series compared to
>>>> the 1.1 series, some, in particular, were dealing with TCP socket
>>>> timeouts (which are important when dealing with large numbers of MPI
>>>> processes).
>>>>
>>>>
>>>>
>>>> On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to run a test mpiHelloWorld program that simply
>>>>> initializes the MPI environment on all the nodes, prints the
>>>>> hostname and rank of each node in the MPI process group and exits.
>>>>>
>>>>> I am using MPI 1.1.2 and am running 997 processes on 499 nodes
>>>>> (Nodes have 2 dual core CPUs).
>>>>>
>>>>> I get the following error messages when I run my program as follows:
>>>>> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
>>>>> .....
>>>>> .....
>>>>> .....
>>>>> [0,1,380][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
>>>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>>>> [0,1,140][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
>>>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>>>> connect() failed with errno=113
>>>>> connect() failed with errno=113connect() failed with
>>>>> errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>>> connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
>>>>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>>>>> connect() failed with errno=113
>>>>> [0,1,388][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>>>> errno=113[0,1,386][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>>>> errno=113
>>>>> [0,1,139][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>>>>> errno=113
>>>>> connect() failed with errno=113
>>>>> .....
>>>>> .....
>>>>>
>>>>> The main thing is that I get these error messages around 3-4 times
>>>>> out of 10 attempts with the rest all completing successfully. I have
>>>>> looked into the FAQs in detail and also checked the tcp btl settings
>>>>> but am not able to figure it out.
>>>>>
>>>>> All the 499 nodes have only eth0 active and I get the error even
>>>>> when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
>>>>> --mca btl_tcp_if_include eth0 /main/mpiHelloWorld
>>>>>
>>>>> I have attached the output of ompi_info ?all.
>>>>>
>>>>> The following is the output of /sbin/ifconfig on the node where I
>>>>> start the mpi process (it is one of the 499 nodes)
>>>>>
>>>>> eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6
>>>>> inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
>>>>> 17
>>>>> TX packets:1767028063 errors:0 dropped:0 overruns:0
>>>>> carrier:0
>>>>> collisions:0 txqueuelen:1000
>>>>> RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552
>>>>> (657385.4 Mb)
>>>>> Interrupt:22 Base address:0xc000
>>>>>
>>>>> lo Link encap:Local Loopback
>>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>>> RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:0
>>>>> RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9
>>>>> Mb)
>>>>>
>>>>>
>>>>> Kindly help.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Prasanna.
>>>>>
>>>>> <ompi_info.rtf>_______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>