Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] predefined ompi_t types break strict-aliasing rules
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-04-24 13:57:46


Number Cruncher wrote:
> Many thanks for the informative explanation, Jeff. I appreciate this
> issue has been the cause of some deliberation!
>
>>
>> This was the changeset where we did the ABI fixes -- ensuring that if
>> you compile/link against Open MPI vA.B.C, you will be able to just
>> change your LD_LIBRARY_PATH or replace libmpi.so when upgrading to
>> Open MPI vA.B.x.
>> But -- to make a long
>> story short -- the linker will embed the length of the
>> ompi_communicator_t in the hello executable. Among other reasons,
>> it's so that pointer math can be computed properly.
>
> The goal is admirable and a stalwart of good open source practice
> which avoids "DLL-Hell". However, I simply don't understand how my
> compiler can *ever* know the size of your opaque ompi_communicator_t?
>
And in this case the client code doesn't need to know the size of the
struct, we are only concerned with the address of the opaque handle so
it can be passed on.
> mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct
> ompi_communicator_t", ompi_mpi_comm_world is an external global of
> as-yet-undefined "struct ompi_predefined_communicator_t". Then the
> compiler must try and work out whether the "comm_predefined_t *"
> aliases the "comm_t *". This isn't possible without full type
> information. In general, only (void *) and (char *) can alias
> arbitrary types.
>
> See
> http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html
> for an excellent discussion on aliasing.
>
>> However, let's assume the user upgrades to OMPI vA.B.(C+1) in place,
>> meaning that libmpi.so is replaced with a new version. If we've
>> changed the back-end ompi_communicator_t struct (e.g., added a
>> member), then the size that is included in the hello executable no
>> longer matches the size that is in libmpi.so -- and Bad Things can
>> (and do) happen.
>
> I don't get how the linker can affect the actual object code generated
> for my hello.o, except via relocation of symbols. If none of my code
> makes use of, or even *knows* about your internal struct definitions,
> and only holds pointers to them, then I can't do any pointer
> arithmetic or anything that requires internal information; I get:
> "invalid use of undefined type ‘struct XXXX'" or similar.
>
> I can hold a simple pointer to your private object, but this should
> maintain ABI compatibility; isn't that what the PIMPL C++ idiom is all
> about and widely used (e.g. Qt Toolkit). The pointer-to-global gets
> relocated at shared library load time.
>
In C++ you can do the above however that is not valid C code when trying
to initialize a global pointer. The initializer must be a compile-time
constant. It was this artifact that made us go down the path of padded
structures.
>
>>
>> Hence, the solution of r20627: make a "dummy" type that is guaranteed
>> to be larger than ompi_communicator_t --
>> ompi_predefined_communicator_t. It's actually a struct that
>> *contains* ompi_communicator_t and then a fixed amount of padding at
>> the end. Since MPI_COMM_WORLD will always be this
>> ompi_predefined_communicator_t, we can ensure that its size stays
>> constant, even if the size of ompi_communicator_t changes. Hence,
>> the size in the hello executable and libmpi.so will be the same.
>> Happiness.
>
> With respect, this feels a little bit like a hack. Who's to say your
> future communicator won't need to be even bigger than the current
> padding allows? - and then the assumptions made when linking against
> the old predefined_t would no longer apply leading to corruption.
However, we've have enough room to store a pointer to an extension to
the structure. So, if we reach to the point that we are near the end of
our pad the last thing we put in a structure should be a pointer to an
extension. Gross but workable. I am actually not that convinced we
will actually reach that point but who knows.
>
>>
>>> The compiler complains because ompi_mpi_comm_world is declared as an
>>> "extern struct ompi_predefined_communicator_t" but the type is
>>> incomplete, so it can't tell whether the cast is a permissible
>>> almost-the-same type pun (e.g. an "int" can alias an "unsigned").
>>>
>>> I think this is potentially a serious performance issue for anyone
>>> using
>>> OpenMPI in a C++ environment, and the profuse warnings preclude it's
>>> use
>>> in our build system.
>>>
>>
>> I agree that the warnings are Bad.
>>
>> The question is -- why do they only show up in gcc 4.1? More
>> specifically -- why do they *not* show up in other versions of gcc?
>> Is it a gcc 4.1 compiler bug?
>
> Older GCC weren't as strict about aliasing issues. Perhaps the latter
> ones recognise that in this context (a function call with pointer to
> non-local parameter), no optimisation would be possible anyway.
>
>>
>>> The bad news is that the only work around I have is to insert (void *)
>>> casts between (MPI_TYPENAME) and the address operator, e.g.:
>>> #define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
>>>
>>
>> Hmm. I guess that's plausible, but ugly.
>
> Actually, I think it's closer to what you've been arguing above:
> you're insisting that the compiler blindly interpret
> &ompi_mpi_comm_world as a pointer to some memory that really is
> equivalent to the other unknown type communicator_t. Without the
> intermediate (void *), you're suggesting the compiler could possibly
> do better optimization.
>
> Ultimately, for internal use, the (void *) is bad, but from client
> code with no knowledge of your types, it should be mandatory and tells
> the compiler to make no assumptions about aliasing.
>
>>
>>> An alternative might be to make the full type definition available by
>>> #including some of the internal developer headers such as
>>> ompi/communicator/communicator.h
>>>
>>
>>
>> I'm not sure what the right solution is, but that is not attractive.
>> :-)
>>
>
> Completely agree (on both points!).
>
> In summary, I still don't see why holding pointers to opaque types is
> not ABI-compatible, and would suggest the (void *) compiler hint in
> the meantime.
>
Just to re-iterate what I said above holding pointers to opaque types
would be ABI-compatible. However, in C you cannot initialize a global
pointer with a non-compile-time constant. Unfortunately, the MPI
specification requires one to be able to assign the opaque handles, like
MPI_COMM_WORLD and such, to a global variable.

--td