Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-09-15 16:54:51


Excellent!

We developers have talked about creating an FAQ entry for running at
large scale for a long time, but have never gotten a round tuit. I
finally filed a ticket to do this (https://svn.open-mpi.org/trac/ompi/ticket/1503
) -- these pending documentation tickets will likely be handled as we
get very close to the v1.3 release.

On Sep 15, 2008, at 4:38 PM, Prasanna Ranganathan wrote:

> Hi,
>
> I am happy to state that I believe I have finally found the fix for
> the No
> route to host error!!!!
>
> The solution was to increase the ARP cache in the head node and also
> add a
> few static ARP entries. The cache was running out sometime during the
> program execution leading to connection disruption and the error
> messages. I
> am not too sure though as to how the program did successfully run on
> certain
> occasions previously.
>
> I want to thank everyone who helped me with this - particularly Eric
> and
> Jeff - for sharing their thoughts and also for their time and
> effort. Thanks
> a lot guys.
>
> On a side note, the other issue I noticed with the trivial execution
> of my
> helloWorld program with 1 process failing when run in debug mode,
> that is
> something I have not resolved and will take a bit longer since, as
> Eric
> mentioned, I need to upgrade the GCC version and also fix the
> optimization
> flags and update all the nodes. This is something I intend to follow
> up on
> and fix but I ll be doing it a bit later. I ll update the mailing
> list once
> I make any progress on the same.
>
> Again, thanks a lot guys for your invaluable help.
>
> Regards,
>
> Prasanna.
>
> On 9/15/08 11:08 AM, "users-request_at_[hidden]"
> <users-request_at_[hidden]> wrote:
>
>> Message: 1
>> Date: Mon, 15 Sep 2008 12:42:50 -0400
>> From: Eric Thibodeau <kyron_at_[hidden]>
>> Subject: Re: [OMPI users] Need help resolving No route to host error
>> with OpenMPI 1.1.2
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <48CE908A.9080902_at_[hidden]>
>> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>>
>> Simply to keep track of what's going on:
>>
>> I checked the build environment for openmpi and the system's setting,
>> they were built using gcc 3.4.4 with -Os, which was reputed
>> unstable and
>> problematic with this compiler version. I've asked Prasanna to
>> rebuild
>> using -O2 but this could be a bit lengthy since the entire system
>> (or at
>> least all libs openmpi links to) needs to be rebuilt.
>>
>> Eric
>>
>> Eric Thibodeau wrote:
>>> Prasanna,
>>>
>>> Please send me your /etc/make.conf and the contents of
>>> /var/db/pkg/sys-cluster/openmpi-1.2.7/
>>>
>>> You can package this with the following command line:
>>>
>>> tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/
>>> openmpi-1.2.7/
>>>
>>> And simply send me the data.tbz file.
>>>
>>> Thanks,
>>>
>>> Eric
>>>
>>> Prasanna Ranganathan wrote:
>>>> Hi,
>>>>
>>>> I did make sure at the beginning that only eth0 was activated on
>>>> all the
>>>> nodes. Nevertheless, I am currently verifying the NIC
>>>> configuration on all
>>>> the nodes and making sure things are as expected.
>>>>
>>>> While trying different things, I did come across this peculiar
>>>> error which I
>>>> had detailed in one of my previous mails in this thread.
>>>>
>>>> I am testing the helloWorld program in the following trivial case:
>>>>
>>>> mpirun -np 1 -host localhost /main/mpiHelloWorld
>>>>
>>>> Which works fine.
>>>>
>>>> But,
>>>>
>>>> mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld
>>>>
>>>> always fails as follows:
>>>>
>>>> Daemon [0,0,1] checking in as pid 2059 on host localhost
>>>> [idx1:02059] [0,0,1] orted: received launch callback
>>>> idx1 is node 0 of 1
>>>> ranks sum to 0
>>>> [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0]
>>>> [idx1:02059] [0,0,1] orted_recv_pls: received exit
>>>> [idx1:02059] *** Process received signal ***
>>>> [idx1:02059] Signal: Segmentation fault (11)
>>>> [idx1:02059] Signal code: (128)
>>>> [idx1:02059] Failing at address: (nil)
>>>> [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30]
>>>> [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close
>>>> +0x18)
>>>> [0x2afa8be8e2a2]
>>>> [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize
>>>> +0x70)
>>>> [0x2afa8be795ac]
>>>> [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
>>>> [0x2afa8be7675c]
>>>> [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae]
>>>> [idx1:02059] *** End of error message ***
>>>>
>>>> The failure happens with more verbose output when using the -d
>>>> flag.
>>>>
>>>> Does this point to some bug in OpenMPI or am I missing something
>>>> here?
>>>>
>>>> I have attached ompi_info output on this node.
>>>>
>>>> Regards,
>>>>
>>>> Prasanna.
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems