On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote:
> What seems to be happening is this: the code of the server is
> written in
> such a manner that the server knows how many "responses" it's supposed
> to receive from all the clients, so when all the calculation tasks
> been distributed, the server enters a loop inside which it calls
> MPI_Waitany on an array of handles until it receives all the
> results it
> expects. However, from my debug prints it looks like all the clients
> think they've sent all the results they could, and they're now all
> sitting in MPI_Probe, waiting for the server to send out the next
> instruction (which is supposed to contain a message indicating the end
> of the run). So, the server is stuck in MPI_Waitany() while all the
> clients are stuck in MPI_Probe().
On the server side, try putting in a debug loop and see if any of the
requests that your app is waiting for are not MPI_REQUEST_NULL (it's
not a value of 0 -- you'll need to compare against
MPI_REQUEST_NULL). If there are any, see if you can trace backwards
to see what request it is.
> I was wondering if you could comment on the "readv failed" messages
> seeing in the server's stderr:
> mca_btl_tcp_frag_recv: readv failed with errno=110
> I'm seeing a few of these along the server's run, with errno=110
> ("Connection timed out" according to the "perl -e 'die$!=errno'"
> I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to
> host"). Could this mean there's an occasional infrastructure
> problem? It
> would be strange, as it would then seem that this particular run
> triggers it?.. Could these messages also mean that some messages got
> lost due to these errors, and that's why the server thinks it still
> some results to receive while the clients think they've sent
> everything out?
That is all possible. Sorry I missed that message in your original
message -- it's basically a message saying that MPI_COMM_WORLD rank 0
got a timeout from one of the peers that it shouldn't have.
You're sure that none of your processes are exiting early, right?
You said they were all waiting in MPI_Probe, but I just wanted to
double check that they're all still running.
Unfortunately, our error message is not very clear about which host
it lost the connection with -- after you see that message, do you see
incoming communications from all the slaves, or only some of them?