Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Prasanna Ranganathan (prasanna_at_[hidden])
Date: 2008-09-15 16:38:38


Hi,

I am happy to state that I believe I have finally found the fix for the No
route to host error!!!!

The solution was to increase the ARP cache in the head node and also add a
few static ARP entries. The cache was running out sometime during the
program execution leading to connection disruption and the error messages. I
am not too sure though as to how the program did successfully run on certain
occasions previously.

I want to thank everyone who helped me with this - particularly Eric and
Jeff - for sharing their thoughts and also for their time and effort. Thanks
a lot guys.

On a side note, the other issue I noticed with the trivial execution of my
helloWorld program with 1 process failing when run in debug mode, that is
something I have not resolved and will take a bit longer since, as Eric
mentioned, I need to upgrade the GCC version and also fix the optimization
flags and update all the nodes. This is something I intend to follow up on
and fix but I ll be doing it a bit later. I ll update the mailing list once
I make any progress on the same.

Again, thanks a lot guys for your invaluable help.

Regards,

Prasanna.

On 9/15/08 11:08 AM, "users-request_at_[hidden]"
<users-request_at_[hidden]> wrote:

> Message: 1
> Date: Mon, 15 Sep 2008 12:42:50 -0400
> From: Eric Thibodeau <kyron_at_[hidden]>
> Subject: Re: [OMPI users] Need help resolving No route to host error
> with OpenMPI 1.1.2
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <48CE908A.9080902_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>
> Simply to keep track of what's going on:
>
> I checked the build environment for openmpi and the system's setting,
> they were built using gcc 3.4.4 with -Os, which was reputed unstable and
> problematic with this compiler version. I've asked Prasanna to rebuild
> using -O2 but this could be a bit lengthy since the entire system (or at
> least all libs openmpi links to) needs to be rebuilt.
>
> Eric
>
> Eric Thibodeau wrote:
>> Prasanna,
>>
>> Please send me your /etc/make.conf and the contents of
>> /var/db/pkg/sys-cluster/openmpi-1.2.7/
>>
>> You can package this with the following command line:
>>
>> tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/
>>
>> And simply send me the data.tbz file.
>>
>> Thanks,
>>
>> Eric
>>
>> Prasanna Ranganathan wrote:
>>> Hi,
>>>
>>> I did make sure at the beginning that only eth0 was activated on all the
>>> nodes. Nevertheless, I am currently verifying the NIC configuration on all
>>> the nodes and making sure things are as expected.
>>>
>>> While trying different things, I did come across this peculiar error which I
>>> had detailed in one of my previous mails in this thread.
>>>
>>> I am testing the helloWorld program in the following trivial case:
>>>
>>> mpirun -np 1 -host localhost /main/mpiHelloWorld
>>>
>>> Which works fine.
>>>
>>> But,
>>>
>>> mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld
>>>
>>> always fails as follows:
>>>
>>> Daemon [0,0,1] checking in as pid 2059 on host localhost
>>> [idx1:02059] [0,0,1] orted: received launch callback
>>> idx1 is node 0 of 1
>>> ranks sum to 0
>>> [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0]
>>> [idx1:02059] [0,0,1] orted_recv_pls: received exit
>>> [idx1:02059] *** Process received signal ***
>>> [idx1:02059] Signal: Segmentation fault (11)
>>> [idx1:02059] Signal code: (128)
>>> [idx1:02059] Failing at address: (nil)
>>> [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30]
>>> [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
>>> [0x2afa8be8e2a2]
>>> [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
>>> [0x2afa8be795ac]
>>> [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
>>> [0x2afa8be7675c]
>>> [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae]
>>> [idx1:02059] *** End of error message ***
>>>
>>> The failure happens with more verbose output when using the -d flag.
>>>
>>> Does this point to some bug in OpenMPI or am I missing something here?
>>>
>>> I have attached ompi_info output on this node.
>>>
>>> Regards,
>>>
>>> Prasanna.
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users