Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer
From: George Bosilca (bosilca_at_[hidden])
Date: 2012-03-05 15:42:08

I was afraid about all those little intermediary steps. I asked a compiler guy and apparently reversing the order (aka starting with the ptrdiff_t variable) will not solve anything. The only portable way to solve this is to cast every single member, to prevent __any__ compiler from hurting us.


On Mar 5, 2012, at 13:40 , Larry Baker wrote:

> George,
> I think Yuki's interpretation is correct.
>>> The following is one of the suspicious parts.
>>> (Many similar code in ompi/coll/tuned/*.c)
>>> --- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)---
>>> 398 tmprecv = (char*) rbuf + rank * rcount * rext;
>>> -----------------------------------------------------------------
>>> if this condition is met, "rank * rcount" is overflowed.
>>> So, we fixed it tentatively like following:
>>> (cast int to size_t)
>>> --- in ompi/coll/tuned/coll_tuned_allgather.c --------------
>>> 398 tmprecv = (char*) rbuf + (size_t)rank * rcount * rext;
>>> ------------------------------------------------------------
>> Based on my understanding of the C standard this operation should be done on the most extended type, in this particular case the one of the rext (ptrdiff_t). Thus I would say the displacement should be correctly computed.
> In my copy of C99, section 6.5 Expressions says " the order of evaluation of subexpressions and the order in which side effects take place are both unspecified. There is a footnote 71 that "specifies the precedence of operators in the evaluation of an expressions, which is the same as the order of the major subclauses of this subclause, highest precedence first." It is the footnote that implies multiplication (6.5.5 Multiplicative operators) has higher precedence than addition (6.5.6 Additive operators) in the expression "(char*) rbuf + rank * rcount * rext". But, the main text states that there is no ordering of the subexpression "rank * rcount * rext". When the compiler chooses to evaluate "rank * rcount" first, the overflow described by Yuki can result. I think you are correct that the subexpression will get promoted to (ptrdiff_t), but that is not quite the same thing.
> Larry Baker
> US Geological Survey
> 650-329-5608
> baker_at_[hidden]