Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-09-12 09:56:59


On Sep 11, 2008, at 6:29 PM, Prasanna Ranganathan wrote:

> I have tried the following to no avail.
>
> On 499 machines running openMPI 1.2.7:
>
> mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ...
>
> With different combinations of the following parameters
>
> -mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose
> 1 -mca
> oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca
> btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120
>
> I still get the No route to Host error messages.

This is quite odd -- with the oob_tcp_listen_mode option, we have run
jobs in the thousands of processes in the v1.2 series. The startup is
still a bit slow (it's vastly improved in the upcoming v1.3 series),
but we didn't run into problems like this.

Can you absolutely verify that you are running 1.2.7 on all of your
nodes and you have specified "-mca oob_tcp_listen_mode listen_thread"
on the mpirun command line? The important part here is that when you
invoke OMPI v1.2.7's mpirun on the head node, you are also using
v1.2.7 on all the back-end nodes as well.

> Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons
> and did
> not get any additional useful debug output other than the error
> messages.
>
> I did notice one strange thing though. The following is always
> successful
> (atleast all my attempts)
>
> mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
>
> but
>
> mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
> --debug-daemons
>
> prints these error messages at the end from each of the nodes :
>
> [idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [idx2:04064] [0,0,1] orted_recv_pls: received exit
> [idx2:04064] *** Process received signal ***
> [idx2:04064] Signal: Segmentation fault (11)
> [idx2:04064] Signal code: (128)
> [idx2:04064] Failing at address: (nil)
> [idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30]
> [idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close
> +0x18)
> [0x2b92cc0202a2]
> [idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize
> +0x70)
> [0x2b92cc00b5ac]
> [idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
> [0x2b92cc00875c]
> [idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae]
> [idx2:04064] *** End of error message ***
>
>
> I am not sure if this points to the actual cause for these issues.
> Is is to
> do with the openMPI 1.2.7 having posix enabled in the current
> configuration
> on these nodes?

POSIX threads enabled should not cause these issues. What you want to
see in ompi_info output is the following:

[6:46] svbu-mpi:~/hg/openib-fd-progress % ompi_info | grep thread
           Thread support: posix (mpi: no, progress: no)

The two "no"'s are what are important here.

-- 
Jeff Squyres
Cisco Systems