Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Daniel Rozenbaum (drozenbaum_at_[hidden])
Date: 2007-10-18 14:20:52

Yes, a memory bug has been my primary focus due to the not entirely
consistent nature of this problem; I valgrind'ed the app a number of
times, to no avail though. Will post again if anything new comes up...

Jeff Squyres wrote:
> Yes, that's the normal progression. For some reason, OMPI appears to
> have decided that it had not yet received the message. Perhaps a
> memory bug in your application...? Have you run it through valgrind,
> or some other memory-checking debugger, perchance?
> On Oct 18, 2007, at 12:35 PM, Daniel Rozenbaum wrote:
>> Unfortunately, so far I haven't even been able to reproduce it on a
>> different cluster. Since I had no success getting to the bottom of
>> this
>> problem, I've been concentrating my efforts on changing the app so
>> that
>> there's no need to send very large messages; I might be able to find
>> time later to create a short example that shows the problem.
>> FWIW, when I was debugging it, I peeked a little into Open MPI
>> code, and
>> found that the client's MPI_Recv gets stuck in mca_pml_ob1_recv(),
>> after
>> it determines that "recvreq-
>>> req_recv.req_base.req_ompi.req_complete ==
>> false" and calls opal_condition_wait().
>> Jeff Squyres wrote:
>>> Can you send a short test program that shows this problem, perchance?
>>> On Oct 3, 2007, at 1:41 PM, Daniel Rozenbaum wrote:
>>>> Hi again,
>>>> I'm trying to debug the problem I posted on several times recently;
>>>> I thought I'd try asking a more focused question:
>>>> I have the following sequence in the client code:
>>>> MPI_Status stat;
>>>> ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
>>>> assert(ret == MPI_SUCCESS);
>>>> ret = MPI_Get_elements(&stat, MPI_BYTE, &count);
>>>> assert(ret == MPI_SUCCESS);
>>>> char *buffer = malloc(count);
>>>> assert(buffer != NULL);
>>>> ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG,
>>>> assert(ret == MPI_SUCCESS);
>>>> fprintf(stderr, "MPI_Recv done\n");
>>>> <proceed to taking action on the received buffer, send response to
>>>> server>
>>>> Each MPI_ call in the lines above is surrounded by debug prints
>>>> that print out the client's rank, current time, the action about to
>>>> be taken with all its parameters' values, and the action's result.
>>>> After the first cycle (receive message from server -- process it --
>>>> send response -- wait for next message) works out as expected, the
>>>> next cycle get stuck in MPI_Recv. What I get in my debug prints is
>>>> more or less the following:
>>>> MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm= MPI_COMM_WORKD,
>>>> status= <address1>)
>>>> MPI_Probe done, source= 0, tag= 2, error= 0
>>>> MPI_Get_elements(status= <address1>, dtype= MPI_BYTE, count=
>>>> <address2>)
>>>> MPI_Get_elements done, count= 2731776
>>>> MPI_Recv(buf= <address3>, count= 2731776, dtype= MPI_BYTE, src= 0,
>>>> tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE)
>>>> <nothing beyond this point. Some time afterwards there're "readv
>>>> failed" errors in server's stderr>
>>>> My question then is this - what would cause MPI_Recv to not return,
>>>> after the immediately preceding MPI_Probe and MPI_Get_elements
>>>> return properly?
>>>> Thanks,
>>>> Daniel