Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-10-18 14:06:24


Yes, that's the normal progression. For some reason, OMPI appears to
have decided that it had not yet received the message. Perhaps a
memory bug in your application...? Have you run it through valgrind,
or some other memory-checking debugger, perchance?

On Oct 18, 2007, at 12:35 PM, Daniel Rozenbaum wrote:

> Unfortunately, so far I haven't even been able to reproduce it on a
> different cluster. Since I had no success getting to the bottom of
> this
> problem, I've been concentrating my efforts on changing the app so
> that
> there's no need to send very large messages; I might be able to find
> time later to create a short example that shows the problem.
>
> FWIW, when I was debugging it, I peeked a little into Open MPI
> code, and
> found that the client's MPI_Recv gets stuck in mca_pml_ob1_recv(),
> after
> it determines that "recvreq-
> >req_recv.req_base.req_ompi.req_complete ==
> false" and calls opal_condition_wait().
>
> Jeff Squyres wrote:
>> Can you send a short test program that shows this problem, perchance?
>>
>>
>> On Oct 3, 2007, at 1:41 PM, Daniel Rozenbaum wrote:
>>
>>
>>> Hi again,
>>>
>>> I'm trying to debug the problem I posted on several times recently;
>>> I thought I'd try asking a more focused question:
>>>
>>> I have the following sequence in the client code:
>>> MPI_Status stat;
>>> ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
>>> assert(ret == MPI_SUCCESS);
>>> ret = MPI_Get_elements(&stat, MPI_BYTE, &count);
>>> assert(ret == MPI_SUCCESS);
>>> char *buffer = malloc(count);
>>> assert(buffer != NULL);
>>> ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG,
>>> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>> assert(ret == MPI_SUCCESS);
>>> fprintf(stderr, "MPI_Recv done\n");
>>> <proceed to taking action on the received buffer, send response to
>>> server>
>>> Each MPI_ call in the lines above is surrounded by debug prints
>>> that print out the client's rank, current time, the action about to
>>> be taken with all its parameters' values, and the action's result.
>>> After the first cycle (receive message from server -- process it --
>>> send response -- wait for next message) works out as expected, the
>>> next cycle get stuck in MPI_Recv. What I get in my debug prints is
>>> more or less the following:
>>> MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm= MPI_COMM_WORKD,
>>> status= <address1>)
>>> MPI_Probe done, source= 0, tag= 2, error= 0
>>> MPI_Get_elements(status= <address1>, dtype= MPI_BYTE, count=
>>> <address2>)
>>> MPI_Get_elements done, count= 2731776
>>> MPI_Recv(buf= <address3>, count= 2731776, dtype= MPI_BYTE, src= 0,
>>> tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE)
>>> <nothing beyond this point. Some time afterwards there're "readv
>>> failed" errors in server's stderr>
>>> My question then is this - what would cause MPI_Recv to not return,
>>> after the immediately preceding MPI_Probe and MPI_Get_elements
>>> return properly?
>>>
>>> Thanks,
>>> Daniel
>>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems