On Thu, 29 Jun 2006, Jeff Squyres (jsquyres) wrote:
> I think you may have caught us in an unintentional breakage. If your
> Open MPI was compiled as shared libraries and dynamic shared objects (the
> default), this error should not have happened since we did not change
Sure, I simply use the default.
> So there must be a second-order effect going on here (somehow the
> size of a back-end data structure caused a problem. Hrm.).
> We'll look into this, because for where all of OMPI's functionality is
> in shared libraries and components, the user's application should be
> isolated from internal changes like this (i.e., and we can provide
> forward compatibility).
> I suspect that something deeper is going on, so let us go investigate
> and come back with a more definitive statement.
Well, following the warnings, I check the size of the ompi_mpi_comm_null
and ompi_mpi_comm_world symbols in both the library and the executable
with objdump -T:
OpenMPI 1.1 library:
00000000001e8140 g DO .bss 00000000000001c8 Base ompi_mpi_comm_null
00000000001e83a0 g DO .bss 00000000000001c8 Base ompi_mpi_comm_world
OpenMPI 1.0.2 executable:
000000000058f3d0 g DO .bss 00000000000001c0 ompi_mpi_comm_world
000000000058ef00 g DO .bss 00000000000001c0 ompi_mpi_comm_null
So, the size indeed does have changed. Now, MPI_COMM_WORLD is an opaque
pointer, so if the internal data structure changes, this should have no
effect on the functioning of executable.
However, note that ompi_mpi_comm_* are not referenced in the 1.0.2
executable, but declared! The most likely cause of this is that they were
declared in the assembler file using .comm.
The dynamic linker will merge both declarations. Now, merging two symbols
with a different size is hard, the linker will have to make a choice.
Suppose it chooses the declaration in the executable. Then the image in
memory will contain ompi_mpi_comm_* datastructures of $1c0 bytes, while
the library expects them to be $1c8 bytes.
Conclusion: Opaque pointers should not be declared with .comm, they should
just be referenced.
I didn't tell my system details yet: I'm using OpenSuSE 10 on the x86_64
architecture. The compiler does not seem to be of any influence: the
result is the same with Gcc, Intel C and Pathscale.