Here's some more info on the problem I've been struggling with; my
apologies for the lengthy posts, but I'm a little desperate here :-)
I was able to reduce the size of the experiment that reproduces the
problem, both in terms of input data size and the number of slots in
the cluster. The cluster now consists of 6 slots (5 clients), with two
of the clients running on the same node as the server and three others
on another node. This allowed me to follow Brian's
advice and run the server and all the clients under gdb and make
sure none of the processes terminates (normally or abnormally) when the
server reports the "readv failed" errors; this is indeed the case.
I then followed Jeff's
advice and added a debug loop just prior to the server calling
MPI_Waitany(), identifying the entries in the requests array which are
not
MPI_REQUEST_NULL, and then tracing back these
requests. What I found was the following:
At some point during the run, the server calls MPI_Waitany() on an
array of requests consisting of 96 elements, and gets stuck in it
forever; the only thing that happens at some point thereafter is that
the server reports a couple of "readv failed" errors:
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
According to my debug prints, just before that last call to
MPI_Waitany() the array requests[] contains 38 entries which are not
MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(),
half to Irecv(). Specifically, for example, entries
4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to
client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the
same client.
I traced back what's going on, for instance, with requests[4]. As I
mentioned, it corresponds to a call to MPI_Isend() initiated by the
server to client #3 (of 5). By the time the server gets stuck in
Waitany(), this client has already correctly processed the first
Isend() from master in requests[4], returned its response in
requests[5], and the server received this response properly. After
receiving this response, the server Isend()'s the next task to this
client in requests[4], and this is correctly reflected in "requests[4]
!= MPI_REQUESTS_NULL" just before the last call to Waitany(), but for
some reason this send doesn't seem to go any further.
Looking at all other requests[] corresponding to Isend()'s initiated by
the server to the same client (14,24,...,94), they're all also not
MPI_REQUEST_NULL, and are not going any further either.
One thing that might be important is that the messages the server is
sending to the clients in my experiment are quite large, ranging from
hundreds of Kbytes to several Mbytes, the largest being around 9
Mbytes. The largest messages take place at the beginning of the run and
are processed correctly though.
Also, I ran the same experiment on another cluster that uses slightly
different
hardware and network infrastructure, and could not reproduce the
problem.
Hope at least some of the above makes some sense. Any additional advice
would be greatly appreciated!
Many thanks,
Daniel
Daniel Rozenbaum wrote:
I'm now running the same experiment under valgrind. It's probably
going to run for a few days, but interestingly what I'm seeing now is
that while running under valgrind's memcheck, the app has been
reporting much more of these "recv failed" errors, and not only on the
server node:
[host1][0,1,0]
[host4][0,1,13]
[host5][0,1,18]
[host8][0,1,30]
[host10][0,1,36]
[host12][0,1,46]
If in the original run I got 3 such messages, in the valgrind'ed run I
got about 45 so far, and the app still has about 75% of the work left.
I'm checking while all this is happening, and all the client processes
are still running, none exited early.
I've been analyzing the debug output in my original experiment, and it
does look like the server never receives any new messages from two of
the clients after the "recv failed" messages appear. If my analysis is
correct, these two clients ran on the same host. It might be the case
then that the messages with the next tasks to execute that the server
attempted to send to these two clients never reached them, or were
never sent. Interesting though that there were two additional clients
on the same host, and those seem to have kept working all along, until
the app got stuck.
Once this valgrind experiment is over, I'll proceed to your other
suggestion about the debug loop on the server side checking for any of
the requests the app is waiting for being MPI_REQUEST_NULL.
Many thanks,
Daniel
Jeff Squyres wrote:
On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote:
What seems to be happening is this: the code of the server is
written in
such a manner that the server knows how many "responses" it's supposed
to receive from all the clients, so when all the calculation tasks
have
been distributed, the server enters a loop inside which it calls
MPI_Waitany on an array of handles until it receives all the
results it
expects. However, from my debug prints it looks like all the clients
think they've sent all the results they could, and they're now all
sitting in MPI_Probe, waiting for the server to send out the next
instruction (which is supposed to contain a message indicating the end
of the run). So, the server is stuck in MPI_Waitany() while all the
clients are stuck in MPI_Probe().
On the server side, try putting in a debug loop and see if any of the
requests that your app is waiting for are not MPI_REQUEST_NULL (it's
not a value of 0 -- you'll need to compare against
MPI_REQUEST_NULL). If there are any, see if you can trace backwards
to see what request it is.
I was wondering if you could comment on the "readv failed" messages
I'm
seeing in the server's stderr:
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
I'm seeing a few of these along the server's run, with errno=110
("Connection timed out" according to the "perl -e 'die$!=errno'"
method
I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to
host"). Could this mean there's an occasional infrastructure
problem? It
would be strange, as it would then seem that this particular run
somehow
triggers it?.. Could these messages also mean that some messages got
lost due to these errors, and that's why the server thinks it still
has
some results to receive while the clients think they've sent
everything out?
That is all possible. Sorry I missed that message in your original
message -- it's basically a message saying that MPI_COMM_WORLD rank 0
got a timeout from one of the peers that it shouldn't have.
You're sure that none of your processes are exiting early, right?
You said they were all waiting in MPI_Probe, but I just wanted to
double check that they're all still running.
Unfortunately, our error message is not very clear about which host
it lost the connection with -- after you see that message, do you see
incoming communications from all the slaves, or only some of them?
|
|