Dear Open-MPI
Developers,
investigations on the segmentation fault (see previous postings "Signal:
Segmentation fault (11) Problem") lets us suspect that Open-MPI allows
only a limited number of elements in the description of user-defined
MPI_Datatypes.
Our application segmentation-faults when a large user-defined data structure is
passed to MPI_Send.
The segmentation fault happens in the function ompi_generic_simple_pack in
datatype_pack.c when trying to access pElem (Bad address). The structure pElem
is set in line 276, where it is retrieved as
276: pElem = &(description[pos_desc]);
pos_desc is of type uint32_t with the value 0xffff929f (4294939295), which
itself is set on line 271 by a variable of type int16_t and value -1. This
leads to the indexing of the description structure at position -1, producing
the segmentation fault. The origin of the pos_desc can be faund in the same
function at line 271:
271: pos_desc = pStack->index;
The structure to which pStack is pointing is of type dt_stack, defined in
ompi/datatype/convertor.h starting at line 65, where index is and int16_t and
commented with “index in the element description”:
typedef
struct dt_stack {
int16_t index; /**< index in the element
description */
int16_t type; /**< the type used for the
last pack/unpack (original or DT_BYTE) */
size_t count; /**< number of times we
still have to do it */
ptrdiff_t
disp; /**< actual displacement depending on the
count field */
} dt_stack_t;
We therefore conclude that MPI_Datatypes, which are constructed with Open-MPI
(in the release of 1.2.1a of April 10th 2007)
have the limitation of containing a maximum of 32’768 separate entries.
Although changing the type of the index to int32_t solves the problem of the
segmentation fault, I would be happy if the author / maintainer of the code
could have a look at it and decide if this is viable fix. Having spent a lot of
time in hunting down the issue into the Open-MPI code, I would be glad to see
the issue fixed in upcoming releases.
Thanx and regards,
Michael Gauckler