Have you gone to those nodes and checked their IP addresses of -all- interfaces? OMPI must be picking up those addresses from somewhere - best guess is that those nodes have multiple interfaces on them, some of which are configured to those addresses.
Remember: we don't look at the /etc/hosts file where mpirun is executed to get the addresses. Processes started on each remote node actually query the addresses of all available interfaces on that node. The result is frequently different than the address provided in your /etc/hosts file.
On Jul 10, 2011, at 7:45 PM, zhuangchao wrote:
I run the following command :
/data1/cluster/openmpi/bin/mpirun -d -machinefile /tmp/nodes.10515.txt -np 3 /data1/cluster/mpiblast-pio-1.6/bin/mpiblast -p blastn -i /data1/cluster/sequences/seq_4.txt -d Baculo_Nucleotide -o /data1/cluster/blast.out/blast.out.10515 -g T -m 0 -F F
Then I get the following error from openmpi:
[node7][[3812,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.0.5 failed: No route to host (113)
[node7][[3812,1],2][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 184.108.40.206 failed: No route to host (113)
The machinefile is defined as following :
192.168.0.5 is not the ip of hosts in the machinefile . 220.127.116.11 is the another ip of node1 . But hostname node1
corresponds to 18.104.22.168 in the /etc/hosts .
why do I get the error ? Can you help me ?
Thank you !
users mailing email@example.com://www.open-mpi.org/mailman/listinfo.cgi/users