Many thanks for the informative explanation, Jeff. I appreciate this
issue has been the cause of some deliberation!
> This was the changeset where we did the ABI fixes -- ensuring that if
> you compile/link against Open MPI vA.B.C, you will be able to just
> change your LD_LIBRARY_PATH or replace libmpi.so when upgrading to Open
> MPI vA.B.x.
> But -- to make a long
> story short -- the linker will embed the length of the
> ompi_communicator_t in the hello executable. Among other reasons, it's
> so that pointer math can be computed properly.
The goal is admirable and a stalwart of good open source practice which
avoids "DLL-Hell". However, I simply don't understand how my compiler
can *ever* know the size of your opaque ompi_communicator_t?
mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct
ompi_communicator_t", ompi_mpi_comm_world is an external global of
as-yet-undefined "struct ompi_predefined_communicator_t". Then the
compiler must try and work out whether the "comm_predefined_t *" aliases
the "comm_t *". This isn't possible without full type information. In
general, only (void *) and (char *) can alias arbitrary types.
for an excellent discussion on aliasing.
> However, let's assume
> the user upgrades to OMPI vA.B.(C+1) in place, meaning that libmpi.so is
> replaced with a new version. If we've changed the back-end
> ompi_communicator_t struct (e.g., added a member), then the size that is
> included in the hello executable no longer matches the size that is in
> libmpi.so -- and Bad Things can (and do) happen.
I don't get how the linker can affect the actual object code generated
for my hello.o, except via relocation of symbols. If none of my code
makes use of, or even *knows* about your internal struct definitions,
and only holds pointers to them, then I can't do any pointer arithmetic
or anything that requires internal information; I get: "invalid use of
undefined type struct XXXX'" or similar.
I can hold a simple pointer to your private object, but this should
maintain ABI compatibility; isn't that what the PIMPL C++ idiom is all
about and widely used (e.g. Qt Toolkit). The pointer-to-global gets
relocated at shared library load time.
> Hence, the solution of r20627: make a "dummy" type that is guaranteed to
> be larger than ompi_communicator_t -- ompi_predefined_communicator_t.
> It's actually a struct that *contains* ompi_communicator_t and then a
> fixed amount of padding at the end. Since MPI_COMM_WORLD will always be
> this ompi_predefined_communicator_t, we can ensure that its size stays
> constant, even if the size of ompi_communicator_t changes. Hence, the
> size in the hello executable and libmpi.so will be the same. Happiness.
With respect, this feels a little bit like a hack. Who's to say your
future communicator won't need to be even bigger than the current
padding allows? - and then the assumptions made when linking against the
old predefined_t would no longer apply leading to corruption.
>> The compiler complains because ompi_mpi_comm_world is declared as an
>> "extern struct ompi_predefined_communicator_t" but the type is
>> incomplete, so it can't tell whether the cast is a permissible
>> almost-the-same type pun (e.g. an "int" can alias an "unsigned").
>> I think this is potentially a serious performance issue for anyone using
>> OpenMPI in a C++ environment, and the profuse warnings preclude it's use
>> in our build system.
> I agree that the warnings are Bad.
> The question is -- why do they only show up in gcc 4.1? More
> specifically -- why do they *not* show up in other versions of gcc? Is
> it a gcc 4.1 compiler bug?
Older GCC weren't as strict about aliasing issues. Perhaps the latter
ones recognise that in this context (a function call with pointer to
non-local parameter), no optimisation would be possible anyway.
>> The bad news is that the only work around I have is to insert (void *)
>> casts between (MPI_TYPENAME) and the address operator, e.g.:
>> #define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
> Hmm. I guess that's plausible, but ugly.
Actually, I think it's closer to what you've been arguing above: you're
insisting that the compiler blindly interpret &ompi_mpi_comm_world as a
pointer to some memory that really is equivalent to the other unknown
type communicator_t. Without the intermediate (void *), you're
suggesting the compiler could possibly do better optimization.
Ultimately, for internal use, the (void *) is bad, but from client code
with no knowledge of your types, it should be mandatory and tells the
compiler to make no assumptions about aliasing.
>> An alternative might be to make the full type definition available by
>> #including some of the internal developer headers such as
> I'm not sure what the right solution is, but that is not attractive. :-)
Completely agree (on both points!).
In summary, I still don't see why holding pointers to opaque types is
not ABI-compatible, and would suggest the (void *) compiler hint in the