The problem is correctly identified and solved. I already pushed the patch in the trunk. I will create the CMR for both 1.5 and 1.4.
Kudos to the Fujitsu team, that was a tricky one to find. Thanks for you contributions!
On Jan 12, 2012, at 10:39 , Barrett, Brian W wrote:
> George -
> This looks right to me, but the patches are in the datatype engine, so can
> you weigh in?
> On 1/11/12 10:04 PM, "Kawashima" <t-kawashima_at_[hidden]> wrote:
>> Hi Open MPI developers,
>> We, Fujitsu, noticed that one-sided communication with some sort of
>> derived datatype fails on sparc64 machines.
>> In one-sided communication of Open MPI, the structure of datatype of
>> target buffer is:
>> (1) encoded in origin process, and
>> (2) transfered to target process, and
>> (3) decoded in target process.
>> This encoding and decoding are processed in ompi_datatype_args.c and
>> it has consideration of alignment issue. But it seems insufficient.
>> On encoding stage, __ompi_datatype_pack_description function
>> has consideration of alignment issue, as described in its comment.
>> For derived datatypes of one level, that code is OK.
>> But for derived datatypes of multiple level (i.e. derived datatypes
>> created from derived datatypes), __ompi_datatype_pack_description
>> function is called recursively with unaligned packed_buffer if
>> args->ci is odd.
>> On the other hand, on decoding stage,
>> __ompi_datatype_create_from_packed_description function expects
>> a padding for odd args->ci. For derived datatypes, packed_buffer is
>> always aligned to 64 bits even if this function is called recursively.
>> This incompatibility causes a segmentation fault or something
>> in ompi_ddt_create_xxxx function called by __ompi_ddt_create_from_args
>> It seems decoding stage and buffer size calculation (ALLOC_ARGS macro)
>> have an enough consideration of alignment issue. So I think fixing
>> stage is sufficient for this bug.
>> I've attached patches for trunk and v1.4 branch respectively.
>> A program (needs sparc64) to reproduce this probrem is also attached.
>> This bug appears if all following conditions are met.
>> - sparc64 or some alignment sensitive architectures
>> (configure generates OPAL_ALIGN_WORD_SIZE_INTEGERS == 1)
>> - use derived datatype for target buffer of one-sided communication
>> - create that derived datatype by multiple level MPI_Type_create_xxxx
>> - use one of following function in second level or later
>> (args->ci is odd)
>> * MPI_Type_create_hvector
>> * MPI_Type_create_struct
>> * MPI_Type_create_hindexed
>> * MPI_Type_create_indexed_block
>> Takahiro Kawashima,
>> MPI development team,
>> devel mailing list
> Brian W. Barrett
> Dept. 1423: Scalable System Software
> Sandia National Laboratories
> devel mailing list