On Wed, 2012-05-09 at 15:24 +0200, Eduardo Morras wrote:
> Sorry for the delay, and sorry again because in last mail i had the
> wrong taste that it was some kind of homework problem.
Don't worry ;).
I simplified the core of the problem just to make it easier to
understand (at least that was my intention xD) . And I wrote all the
information that I found relevant in the opening post (320 CPU's,
versions of OMPI, operative system and infiniband, etc.) precisely
because I wanted to show that it wasn't a homework or anything like
> At 17:41 04/05/2012, you wrote:
> > > The logic of send/recv looks ok. Now, in 5 and 7, recvSize(p2) and
> > > recvSize(p1) function what value returns?
> >All the sendSizes and RecvSizes are constant between iterations and are
> >calculated as a setup before all the calculations start.
> >Do you know what could cause the program to hang with the default value
> >(310) and to work fine with 305? I also tested it with 311 but it hanged
> >so it seems that it is not enough to activate the SEND flag.
> On your code, the only point where it could fail is if one of the
> precalculated message size values is wrongly calculated and executes
> the Recieve where it shouldn't.
Yes, but after the sizes are calculated they don't change and that's why
I find it weird to hang the 30th time the whole communication loop is
executed :S .
> From previous mails i understand that no if(ok!=MPI... line fires
> and there's no Sender waiting. The Ssend ends when the Recv starts to
> receive, not when the Recv ends the receive, so the sender may get an
> Ok but if there's an error Recv keeps the block. As you are using
> blocking communications, you can't do anything to prevent this, for
> example, check the Recv status while receiving.
I don't know how to check the Recv status because the processor remains
waiting for the message at the Recv function.
> Try to use Send instead Ssend (it should work but it could hang too)
> or change design to a non-blocking approach.
The problem is that it also hangs with non-blocking communications. The
real program is coded with non-blocking communications and it started to
hang when the size of the mesh got bigger. I just changed to blocking
communications to easy the debugging task.
Now it works, with blocking and non-blocking communications, just
changing the value of the mca parameter btl_openib_flags to 304 or 305
(the default value is 310). That means that the problem is with the RDMA
protocols in infiniband for large messages. As far as I know, with those
values the flags GET(4) and PUT(2) are deactivated and the protocol for
large messages remains the same as the one for small messages
(send/receive). For me, it seems that there is a bug (problably a memory
leak) in OMPI or OFED.
Thanks for your help,
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.