Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] predefined ompi_t types break strict-aliasing rules
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-26 09:07:08


On Apr 24, 2009, at 1:24 PM, Number Cruncher wrote:

> The goal is admirable and a stalwart of good open source practice
> which
> avoids "DLL-Hell". However, I simply don't understand how my compiler
> can *ever* know the size of your opaque ompi_communicator_t?
>

I'm not enough of a linker expert to explain why, but before that
commit, the linker did embed the size of ompi_communicator_t into your
executable.

A simple way to verify this is to do the following:

- compile and link an MPI "hello world" application against Open MPI
1.2.x
- run the app, verify that it runs properly
- change your LD_LIBRARY_PATH and have your "hello world" app find
Open MPI 1.3 / 1.3.1's libmpi.so
- run your app
- notice a bunch of messages something like the following (off the top
of my head; I don't remember the exact message):

ompi_mpi_comm_world size difference between hello and libmpi.so;
consider re-linking

So it's not the *compiler* that creates the problem -- who, I agree,
does not know the definition of ompi_comnunicator_t. It's the
*linker*, that knows the *size* of these objects (not necessarily the
definition).

You can simulate this problem with a much smaller non-MPI test code,
too. I'll admit that this problem kind of blind-sided us -- we never
expected it because we initially had the same thoughts that you did:
they're just pointer values that are resolved at run-time; why does
the size of the back-end struct matter? Unfortunately, it does. :-(

> mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct
> ompi_communicator_t", ompi_mpi_comm_world is an external global of
> as-yet-undefined "struct ompi_predefined_communicator_t". Then the
> compiler must try and work out whether the "comm_predefined_t *"
> aliases
> the "comm_t *". This isn't possible without full type information. In
> general, only (void *) and (char *) can alias arbitrary types.
>
> See
> http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html
> for an excellent discussion on aliasing.
>

Understood.

> > Hence, the solution of r20627: make a "dummy" type that is
> guaranteed to
> > be larger than ompi_communicator_t --
> ompi_predefined_communicator_t.
> > It's actually a struct that *contains* ompi_communicator_t and
> then a
> > fixed amount of padding at the end. Since MPI_COMM_WORLD will
> always be
> > this ompi_predefined_communicator_t, we can ensure that its size
> stays
> > constant, even if the size of ompi_communicator_t changes. Hence,
> the
> > size in the hello executable and libmpi.so will be the same.
> Happiness.
>
> With respect, this feels a little bit like a hack. Who's to say your
> future communicator won't need to be even bigger than the current
> padding allows? - and then the assumptions made when linking against
> the
> old predefined_t would no longer apply leading to corruption.
>

Definitely true. Here's our rationale:

- As Terry mentioned, MPI requires that you can do stuff like:

     MPI_Comm foo = MPI_COMM_WORLD;

before calling MPI_INIT. This is where much of the evil comes from;
MPI_COMM_WORLD has to be a compile-time constant; it can't be a global
variable that is assigned a meaningful value during MPI_INIT (for
example).

- The padding is gross, and we could certainly run out of space
someday. However, this concern is mitigated by three points:

   1. We tried to give ourselves enough padding that we shouldn't run
out of space (admittedly, we can't predict the future, though, so this
may still fail someday)

   2. We probably didn't spell out the conditions well enough in our
current documentation, but I believe that the ABI guarantees really
only apply to the "stable" series -- e.g., 1.4.x. However, we know
the features that are going into 1.3.x (which will then morph into the
1.4.x series), and we don't anticipate changing the sizes of our
structures in 1.3.x and 1.4.x. 1.5.x is a different issue; we don't
know yet if the structs will change size in that series -- and the ABI
is ok to change at 1.5 (it would be nice if it does not, but I don't
know if we've taken a hard-line guarantee that an app compiled for
1.3.x will work with 1.5.x, for example).

   3. As Terry mentioned, if we do grow beyond the current padding,
there is the gross hack of adding a pointer in the current padding to
point to the rest of the struct. Gross, yes. Workable, yes. It's
our failsafe in case we *have* to increase the size of a struct in the
1.3/1.4 series, but can't violate the ABI.

> > The question is -- why do they only show up in gcc 4.1? More
> > specifically -- why do they *not* show up in other versions of
> gcc? Is
> > it a gcc 4.1 compiler bug?
>
> Older GCC weren't as strict about aliasing issues. Perhaps the latter
> ones recognise that in this context (a function call with pointer to
> non-local parameter), no optimisation would be possible anyway.
>

That sounds plausible.

> >> The bad news is that the only work around I have is to insert
> (void *)
> >> casts between (MPI_TYPENAME) and the address operator, e.g.:
> >> #define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))
> >
> > Hmm. I guess that's plausible, but ugly.
>
> Actually, I think it's closer to what you've been arguing above:
> you're
> insisting that the compiler blindly interpret &ompi_mpi_comm_world
> as a
> pointer to some memory that really is equivalent to the other unknown
> type communicator_t. Without the intermediate (void *), you're
> suggesting the compiler could possibly do better optimization.
>
> Ultimately, for internal use, the (void *) is bad, but from client
> code
> with no knowledge of your types, it should be mandatory and tells the
> compiler to make no assumptions about aliasing.
>

I think you're convincing me, but I need to think over this a little
more.

I wonder if we should have a "smart" #define for MPI_COMM_WORLD that
puts the (void*) in when compiling user codes, and doesn't include it
when building OMPI itself (i.e., so that we can get the debugging
benefit of proper type checking when building OMPI). We do have the
OMPI_BUILDING macro in opal_config_bottom.h that could be used to
switch the macro definition. Hrm...

-- 
Jeff Squyres
Cisco Systems