Prasanna,

    I opened up a bug report to enable a better control over the threading options (http://bugs.gentoo.org/show_bug.cgi?id=237435). In the meanwhile, if your helloWorld isn't too fluffy, could you send it over (off list if you prefer) so I can take a look at it, the Segmentation fault is probably hinting at another problem. Also, could you send the output of ompi_info now that you've recompiled openmpi with USE=-threads, I want to make sure the option went through as I hope it should. Simply attach the file named out.txt after running the following command:

ompi_info > out.txt

...RTF files tend to make my eyes cross over ;)

Thanks,

Eric

Prasanna Ranganathan wrote:
Hi,

I have tried the following to no avail.

On 499 machines running openMPI 1.2.7:

mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ...

With different combinations of the following parameters

-mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose 1 -mca
oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca
btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120

I still get the No route to Host error messages.

Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons and did
not get any additional useful debug output other than the error messages.

I did notice one strange thing though. The following is always successful
(atleast all my attempts)

mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld

but

mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
--debug-daemons

prints these error messages at the end from each of the nodes :

[idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx2:04064] [0,0,1] orted_recv_pls: received exit
[idx2:04064] *** Process received signal ***
[idx2:04064] Signal: Segmentation fault (11)
[idx2:04064] Signal code:  (128)
[idx2:04064] Failing at address: (nil)
[idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30]
[idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
[0x2b92cc0202a2]
[idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
[0x2b92cc00b5ac]
[idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
[0x2b92cc00875c]
[idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae]
[idx2:04064] *** End of error message ***


I am not sure if this points to the actual cause for these issues. Is is to
do with the openMPI 1.2.7 having posix enabled in the current configuration
on these nodes? 

Thanks again for your continued help.

Regards,

Prasanna.  

  
Message: 2
Date: Thu, 11 Sep 2008 12:16:50 -0400
From: Jeff Squyres <jsquyres@cisco.com>
Subject: Re: [OMPI users] Need help resolving No route to host error
with OpenMPI 1.1.2
To: Open MPI Users <users@open-mpi.org>
Message-ID: <7110E2D0-EB89-4293-A241-8487174B4788@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

On Sep 10, 2008, at 9:29 PM, Prasanna Ranganathan wrote:

    
I have upgraded to 1.2.7 and am still noticing the issue.
      
FWIW, we didn't change anything with regards to OOB and TCP from 1.2.6
-> 1.2.7, but it's still good to be at the latest version.

Try running with this MCA parameter:

     mpirun --mca oob_tcp_listen_mode listen_thread ...

Sorry; I forgot that we did not enable that option by default in the
v1.2 series.
    

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users