Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Eric Thibodeau (kyron_at_[hidden])
Date: 2008-09-15 12:42:50


Simply to keep track of what's going on:

I checked the build environment for openmpi and the system's setting,
they were built using gcc 3.4.4 with -Os, which was reputed unstable and
problematic with this compiler version. I've asked Prasanna to rebuild
using -O2 but this could be a bit lengthy since the entire system (or at
least all libs openmpi links to) needs to be rebuilt.

Eric

Eric Thibodeau wrote:
> Prasanna,
>
> Please send me your /etc/make.conf and the contents of
> /var/db/pkg/sys-cluster/openmpi-1.2.7/
>
> You can package this with the following command line:
>
> tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/
>
> And simply send me the data.tbz file.
>
> Thanks,
>
> Eric
>
> Prasanna Ranganathan wrote:
>> Hi,
>>
>> I did make sure at the beginning that only eth0 was activated on all the
>> nodes. Nevertheless, I am currently verifying the NIC configuration on all
>> the nodes and making sure things are as expected.
>>
>> While trying different things, I did come across this peculiar error which I
>> had detailed in one of my previous mails in this thread.
>>
>> I am testing the helloWorld program in the following trivial case:
>>
>> mpirun -np 1 -host localhost /main/mpiHelloWorld
>>
>> Which works fine.
>>
>> But,
>>
>> mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld
>>
>> always fails as follows:
>>
>> Daemon [0,0,1] checking in as pid 2059 on host localhost
>> [idx1:02059] [0,0,1] orted: received launch callback
>> idx1 is node 0 of 1
>> ranks sum to 0
>> [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0]
>> [idx1:02059] [0,0,1] orted_recv_pls: received exit
>> [idx1:02059] *** Process received signal ***
>> [idx1:02059] Signal: Segmentation fault (11)
>> [idx1:02059] Signal code: (128)
>> [idx1:02059] Failing at address: (nil)
>> [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30]
>> [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
>> [0x2afa8be8e2a2]
>> [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
>> [0x2afa8be795ac]
>> [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
>> [0x2afa8be7675c]
>> [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae]
>> [idx1:02059] *** End of error message ***
>>
>> The failure happens with more verbose output when using the -d flag.
>>
>> Does this point to some bug in OpenMPI or am I missing something here?
>>
>> I have attached ompi_info output on this node.
>>
>> Regards,
>>
>> Prasanna.
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users