Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI hangs on multiple nodes
From: Ole Nielsen (ole.moller.nielsen_at_[hidden])
Date: 2011-09-19 20:48:04


Thanks for your suggestion Gus, we need a way of debugging what is going on.
I am pretty sure the problem lies with our cluster configuration. I know MPI
simply relies on the underlying network. However, we can ping and ssh to all
nodes (and in between and pair as well) so it is currently a mystery why MPI
doesn't communicate across nodes on our cluster.
Two further questions for the group

   1. I would love to run the test program connectivity.c, but cannot find
   it anywhere. Can anyone help please?
   2. After having left the job hanging over night we got the message
   [node5][[9454,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
   mca_btl_tcp_frag_recv: readv failed: Connection timed out (110). Does anyone
   know what this means?

Cheers and thanks
Ole
PS - I don't see how separate buffers would help. Recall that the test
program I use works fine on other installations and indeed when run on one
the cores of one Node.

Message: 11
Date: Mon, 19 Sep 2011 10:37:02 -0400
From: Gus Correa <gus_at_[hidden]>
Subject: Re: [OMPI users] RE : MPI hangs on multiple nodes
To: Open MPI Users <users_at_[hidden]>
Message-ID: <4E77538E.3070007_at_[hidden]>
Content-Type: text/plain; charset=iso-8859-1; format=flowed

Hi Ole

You could try the examples/connectivity.c program in the
OpenMPI source tree, to test if everything is alright.
It also hints how to solve the buffer re-use issue
that Sebastien [rightfully] pointed out [i.e., declare separate
buffers for MPI_Send and MPI_Recv].

Gus Correa