Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Recv hangs
From: Eduardo Morras (nec556_at_[hidden])
Date: 2012-05-09 11:33:12

At 16:19 09/05/2012, you wrote:

> > On your code, the only point where it could fail is if one of the
> > precalculated message size values is wrongly calculated and executes
> > the Recieve where it shouldn't.
>Yes, but after the sizes are calculated they don't change and that's why
>I find it weird to hang the 30th time the whole communication loop is
>executed :S .

If in your code you don't use sizeof with MPI Datatypes there should
be no problem :)

> >
> > From previous mails i understand that no if(ok!=MPI... line fires
> > and there's no Sender waiting. The Ssend ends when the Recv starts to
> > receive, not when the Recv ends the receive, so the sender may get an
> > Ok but if there's an error Recv keeps the block. As you are using
> > blocking communications, you can't do anything to prevent this, for
> > example, check the Recv status while receiving.
>I don't know how to check the Recv status because the processor remains
>waiting for the message at the Recv function.

That's what i'm pointing. In block mode you can't check that until Recv ends.

> > Try to use Send instead Ssend (it should work but it could hang too)
> > or change design to a non-blocking approach.
>The problem is that it also hangs with non-blocking communications. The
>real program is coded with non-blocking communications and it started to
>hang when the size of the mesh got bigger. I just changed to blocking
>communications to easy the debugging task.
>Now it works, with blocking and non-blocking communications, just
>changing the value of the mca parameter btl_openib_flags to 304 or 305
>(the default value is 310). That means that the problem is with the RDMA
>protocols in infiniband for large messages. As far as I know, with those
>values the flags GET(4) and PUT(2) are deactivated and the protocol for
>large messages remains the same as the one for small messages
>(send/receive). For me, it seems that there is a bug (problably a memory
>leak) in OMPI or OFED.

Some memory leaks were solved in 1.4.5. that affects openib, see release notes.

>Thanks for your help,