Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Eric Thibodeau (kyron_at_[hidden])
Date: 2008-09-11 20:16:35


Prasanna,

    I opened up a bug report to enable a better control over the
threading options (http://bugs.gentoo.org/show_bug.cgi?id=237435). In
the meanwhile, if your helloWorld isn't too fluffy, could you send it
over (off list if you prefer) so I can take a look at it, the
Segmentation fault is probably hinting at another problem. Also, could
you send the output of ompi_info now that you've recompiled openmpi with
USE=-threads, I want to make sure the option went through as I hope it
should. Simply attach the file named out.txt after running the following
command:

ompi_info > out.txt

...RTF files tend to make my eyes cross over ;)

Thanks,

Eric

Prasanna Ranganathan wrote:
> Hi,
>
> I have tried the following to no avail.
>
> On 499 machines running openMPI 1.2.7:
>
> mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ...
>
> With different combinations of the following parameters
>
> -mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose 1 -mca
> oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca
> btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120
>
> I still get the No route to Host error messages.
>
> Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons and did
> not get any additional useful debug output other than the error messages.
>
> I did notice one strange thing though. The following is always successful
> (atleast all my attempts)
>
> mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
>
> but
>
> mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
> --debug-daemons
>
> prints these error messages at the end from each of the nodes :
>
> [idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [idx2:04064] [0,0,1] orted_recv_pls: received exit
> [idx2:04064] *** Process received signal ***
> [idx2:04064] Signal: Segmentation fault (11)
> [idx2:04064] Signal code: (128)
> [idx2:04064] Failing at address: (nil)
> [idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30]
> [idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
> [0x2b92cc0202a2]
> [idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
> [0x2b92cc00b5ac]
> [idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
> [0x2b92cc00875c]
> [idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae]
> [idx2:04064] *** End of error message ***
>
>
> I am not sure if this points to the actual cause for these issues. Is is to
> do with the openMPI 1.2.7 having posix enabled in the current configuration
> on these nodes?
>
> Thanks again for your continued help.
>
> Regards,
>
> Prasanna.
>
>
>> Message: 2
>> Date: Thu, 11 Sep 2008 12:16:50 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] Need help resolving No route to host error
>> with OpenMPI 1.1.2
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <7110E2D0-EB89-4293-A241-8487174B4788_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>>
>> On Sep 10, 2008, at 9:29 PM, Prasanna Ranganathan wrote:
>>
>>
>>> I have upgraded to 1.2.7 and am still noticing the issue.
>>>
>> FWIW, we didn't change anything with regards to OOB and TCP from 1.2.6
>> -> 1.2.7, but it's still good to be at the latest version.
>>
>> Try running with this MCA parameter:
>>
>> mpirun --mca oob_tcp_listen_mode listen_thread ...
>>
>> Sorry; I forgot that we did not enable that option by default in the
>> v1.2 series.
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>