Thanks for your suggestion Gus, we need a way of debugging what is going on.
I am pretty sure the problem lies with our cluster configuration. I know MPI
simply relies on the underlying network. However, we can ping and ssh to all
nodes (and in between and pair as well) so it is currently a mystery why MPI
doesn't communicate across nodes on our cluster.
Two further questions for the group
1. I would love to run the test program connectivity.c, but cannot find
it anywhere. Can anyone help please?
2. After having left the job hanging over night we got the message
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110). Does anyone
know what this means?
Cheers and thanks
PS - I don't see how separate buffers would help. Recall that the test
program I use works fine on other installations and indeed when run on one
the cores of one Node.
Date: Mon, 19 Sep 2011 10:37:02 -0400
From: Gus Correa <gus_at_[hidden]>
Subject: Re: [OMPI users] RE : MPI hangs on multiple nodes
To: Open MPI Users <users_at_[hidden]>
Content-Type: text/plain; charset=iso-8859-1; format=flowed
You could try the examples/connectivity.c program in the
OpenMPI source tree, to test if everything is alright.
It also hints how to solve the buffer re-use issue
that Sebastien [rightfully] pointed out [i.e., declare separate
buffers for MPI_Send and MPI_Recv].