Jeff Squyres wrote:
> On Apr 24, 2009, at 1:24 PM, Number Cruncher wrote:
>> The goal is admirable and a stalwart of good open source practice which
>> avoids "DLL-Hell". However, I simply don't understand how my compiler
>> can *ever* know the size of your opaque ompi_communicator_t?
> I'm not enough of a linker expert to explain why, but before that
> commit, the linker did embed the size of ompi_communicator_t into your
The problem is when you compile your application the handles like
MPI_COMM_WORLD result in undefined symbols in the .o. So when you link
the application it needs to resolve all those symbols. In this case it
resolves them to what libmpi.so has them defined as which includes
location and size. Since these are accessible by the application the
structures are located in bss which means it overrides any other library
location of such a variable. So now we have a handle that size and
location is predefined after link time preventing updating either during
runtime by linking in a different library.
> A simple way to verify this is to do the following:
> - compile and link an MPI "hello world" application against Open MPI
> - run the app, verify that it runs properly
> - change your LD_LIBRARY_PATH and have your "hello world" app find
> Open MPI 1.3 / 1.3.1's libmpi.so
> - run your app
> - notice a bunch of messages something like the following (off the top
> of my head; I don't remember the exact message):
> ompi_mpi_comm_world size difference between hello and libmpi.so;
> consider re-linking
> So it's not the *compiler* that creates the problem -- who, I agree,
> does not know the definition of ompi_comnunicator_t. It's the
> *linker*, that knows the *size* of these objects (not necessarily the
> You can simulate this problem with a much smaller non-MPI test code,
> too. I'll admit that this problem kind of blind-sided us -- we never
> expected it because we initially had the same thoughts that you did:
> they're just pointer values that are resolved at run-time; why does
> the size of the back-end struct matter? Unfortunately, it does. :-(
>> mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct
>> ompi_communicator_t", ompi_mpi_comm_world is an external global of
>> as-yet-undefined "struct ompi_predefined_communicator_t". Then the
>> compiler must try and work out whether the "comm_predefined_t *" aliases
>> the "comm_t *". This isn't possible without full type information. In
>> general, only (void *) and (char *) can alias arbitrary types.
>> for an excellent discussion on aliasing.
>> > Hence, the solution of r20627: make a "dummy" type that is
>> guaranteed to
>> > be larger than ompi_communicator_t -- ompi_predefined_communicator_t.
>> > It's actually a struct that *contains* ompi_communicator_t and then a
>> > fixed amount of padding at the end. Since MPI_COMM_WORLD will
>> always be
>> > this ompi_predefined_communicator_t, we can ensure that its size stays
>> > constant, even if the size of ompi_communicator_t changes. Hence, the
>> > size in the hello executable and libmpi.so will be the same.
>> With respect, this feels a little bit like a hack. Who's to say your
>> future communicator won't need to be even bigger than the current
>> padding allows? - and then the assumptions made when linking against the
>> old predefined_t would no longer apply leading to corruption.
> Definitely true. Here's our rationale:
> - As Terry mentioned, MPI requires that you can do stuff like:
> MPI_Comm foo = MPI_COMM_WORLD;
> before calling MPI_INIT. This is where much of the evil comes from;
> MPI_COMM_WORLD has to be a compile-time constant; it can't be a global
> variable that is assigned a meaningful value during MPI_INIT (for
> - The padding is gross, and we could certainly run out of space
> someday. However, this concern is mitigated by three points:
> 1. We tried to give ourselves enough padding that we shouldn't run
> out of space (admittedly, we can't predict the future, though, so this
> may still fail someday)
> 2. We probably didn't spell out the conditions well enough in our
> current documentation, but I believe that the ABI guarantees really
> only apply to the "stable" series -- e.g., 1.4.x. However, we know
> the features that are going into 1.3.x (which will then morph into the
> 1.4.x series), and we don't anticipate changing the sizes of our
> structures in 1.3.x and 1.4.x. 1.5.x is a different issue; we don't
> know yet if the structs will change size in that series -- and the ABI
> is ok to change at 1.5 (it would be nice if it does not, but I don't
> know if we've taken a hard-line guarantee that an app compiled for
> 1.3.x will work with 1.5.x, for example).
> 3. As Terry mentioned, if we do grow beyond the current padding,
> there is the gross hack of adding a pointer in the current padding to
> point to the rest of the struct. Gross, yes. Workable, yes. It's
> our failsafe in case we *have* to increase the size of a struct in the
> 1.3/1.4 series, but can't violate the ABI.
>> > The question is -- why do they only show up in gcc 4.1? More
>> > specifically -- why do they *not* show up in other versions of
>> gcc? Is
>> > it a gcc 4.1 compiler bug?
>> Older GCC weren't as strict about aliasing issues. Perhaps the latter
>> ones recognise that in this context (a function call with pointer to
>> non-local parameter), no optimisation would be possible anyway.
> That sounds plausible.
>> >> The bad news is that the only work around I have is to insert
>> (void *)
>> >> casts between (MPI_TYPENAME) and the address operator, e.g.:
>> >> #define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
>> > Hmm. I guess that's plausible, but ugly.
>> Actually, I think it's closer to what you've been arguing above: you're
>> insisting that the compiler blindly interpret &ompi_mpi_comm_world as a
>> pointer to some memory that really is equivalent to the other unknown
>> type communicator_t. Without the intermediate (void *), you're
>> suggesting the compiler could possibly do better optimization.
>> Ultimately, for internal use, the (void *) is bad, but from client code
>> with no knowledge of your types, it should be mandatory and tells the
>> compiler to make no assumptions about aliasing.
> I think you're convincing me, but I need to think over this a little
> I wonder if we should have a "smart" #define for MPI_COMM_WORLD that
> puts the (void*) in when compiling user codes, and doesn't include it
> when building OMPI itself (i.e., so that we can get the debugging
> benefit of proper type checking when building OMPI). We do have the
> OMPI_BUILDING macro in opal_config_bottom.h that could be used to
> switch the macro definition. Hrm...