Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer
From: George Bosilca (bosilca_at_[hidden])
Date: 2012-03-05 10:09:06


Yuki,

I pushed a fix for this issue in the trunk (r26097). However, I disagree with you on some of the topics below.

On Mar 5, 2012, at 04:02 , Y.MATSUMOTO wrote:

> Dear All,
>
> Next feedback is about "collective communications".
>
> Collective communication may be abend when it use over 2GiB buffer.
> This problem occurs following condition:
> -- communicator_size * count(scount/rcount) >= 2GiB
> It occurs in even small PC cluster.
>
> The following is one of the suspicious parts.
> (Many similar code in ompi/coll/tuned/*.c)
>
> --- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)---
> 398 tmprecv = (char*) rbuf + rank * rcount * rext;
> -----------------------------------------------------------------
>
> if this condition is met, "rank * rcount" is overflowed.
> So, we fixed it tentatively like following:
> (cast int to size_t)
> --- in ompi/coll/tuned/coll_tuned_allgather.c --------------
> 398 tmprecv = (char*) rbuf + (size_t)rank * rcount * rext;
> ------------------------------------------------------------

Based on my understanding of the C standard this operation should be done on the most extended type, in this particular case the one of the rext (ptrdiff_t). Thus I would say the displacement should be correctly computed.

> It needs not only "ompi/coll/tuned" but also other codes to fix this problem.
> We try to fix, but following functions have problem (argument may be overflowed):
> -"ompi_coll_tuned_sendrecv" may be called when "scount/rcount" sets over 2GiB.
> -"ompi_datatype_copy_content_same_ddt" may be called when "count" sets over 2GiB.

These two should have been fixed by the previous commit (r26097)

> -"basic_linear in Allgather": Bcast may be called when "count" sets over 2GiB.

Fixed in r26098.

  george.

>
> Best Regards,
> Yuki Matsumoto
> MPI development team,
> Fujitsu
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel