Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Daniel Rozenbaum (drozenbaum_at_[hidden])
Date: 2007-09-28 17:12:31

Good Open MPI gurus,

I've further reduced the size of the experiment that reproduces the problem. My array of requests now has just 10 entries, and by the time the server gets stuck in MPI_Waitany(), and three of the clients are stuck in MPI_Recv(), the array has three unprocessed Isend()'s and three unprocessed Irecv()'s.

I've upgraded to Open MPI 1.2.4, but this made no difference.

Are there any internal logging or debugging facilities in Open MPI that would allow me to further track the calls that eventually result in the error in
mca_btl_tcp_frag_recv() ?


Daniel Rozenbaum wrote:
Here's some more info on the problem I've been struggling with; my apologies for the lengthy posts, but I'm a little desperate here :-)

I was able to reduce the size of the experiment that reproduces the problem, both in terms of input data size and the number of slots in the cluster. The cluster now consists of 6 slots (5 clients), with two of the clients running on the same node as the server and three others on another node. This allowed me to follow Brian's advice and run the server and all the clients under gdb and make sure none of the processes terminates (normally or abnormally) when the server reports the "readv failed" errors; this is indeed the case.

I then followed Jeff's advice and added a debug loop just prior to the server calling MPI_Waitany(), identifying the entries in the requests array which are not MPI_REQUEST_NULL, and then tracing back these requests. What I found was the following:

At some point during the run, the server calls MPI_Waitany() on an array of requests consisting of 96 elements, and gets stuck in it forever; the only thing that happens at some point thereafter is that the server reports a couple of "readv failed" errors:

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110

According to my debug prints, just before that last call to MPI_Waitany() the array requests[] contains 38 entries which are not MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(), half to Irecv(). Specifically, for example, entries 4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the same client.

I traced back what's going on, for instance, with requests[4]. As I mentioned, it corresponds to a call to MPI_Isend() initiated by the server to client #3 (of 5). By the time the server gets stuck in Waitany(), this client has already correctly processed the first Isend() from master in requests[4], returned its response in requests[5], and the server received this response properly. After receiving this response, the server Isend()'s the next task to this client in requests[4], and this is correctly reflected in "requests[4] != MPI_REQUESTS_NULL" just before the last call to Waitany(), but for some reason this send doesn't seem to go any further.

Looking at all other requests[] corresponding to Isend()'s initiated by the server to the same client (14,24,...,94), they're all also not MPI_REQUEST_NULL, and are not going any further either.

One thing that might be important is that the messages the server is sending to the clients in my experiment are quite large, ranging from hundreds of Kbytes to several Mbytes, the largest being around 9 Mbytes. The largest messages take place at the beginning of the run and are processed correctly though.

Also, I ran the same experiment on another cluster that uses slightly different hardware and network infrastructure, and could not reproduce the problem.

Hope at least some of the above makes some sense. Any additional advice would be greatly appreciated!
Many thanks,