Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] bug in opal_generic_simple_pack_function()
From: Nadia Derbey (Nadia.Derbey_at_[hidden])
Date: 2013-11-25 08:10:59


George,

Thx for the detailed answer!
I did my tests on a v1.6.2 (changeset: 141b22222759).
After you told me it worked for you with earlier releases, I looked at
the changesets applied since that time. I guess 28fd94d282a3is the one
that fixes my issue?

Regards,
Nadia

On 25/11/2013 13:36, George Bosilca wrote:
> Nadia,
>
> Which version of Open MPI are you using? I tried with the nightly
> r29751, the current 1.6 and the current 1.7 and I __always__ got the
> expected output.
>
> There is a simple way to show what the datatype engine is doing. You
> can set the MCA parameters
> mpi_ddt_unpack_debug and mpi_ddt_pack_debug to get more info. If you
> only want to see how the datatype looks after the MPI_Commit step you
> can call directly ompi_datatype_dump(ddt). This will show the
> internals of the datatype, converted in predefined types.
>
> As an example I took the application you provided and build the
> following picture of what is send and what is received (original
> buffer, send datatype, packed buffer, recv datatype, resulting buffer).
>
>
> Now using the ompi_datatype_dump, I see the recv and the send
> datatypes as:
>
> -cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x0 (0) extent 1
> (size 8)
> -cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x10 (16) extent 1
> (size 8)
> -cC---P-DB-[---][---] OPAL_INT4 count 4 disp 0x30 (48)
> extent 4 (size 16)
> ----------------------G---[---][---] OPAL_END_LOOP pref 3
> elements first elem displacement 0 size of data 32
>
> -cC---P-DB-[---][---] OPAL_UINT1 count 24 disp 0x10 (16) extent 1
> (size 24)
> -cC---P-DB-[---][---] OPAL_UINT1 count 8 disp 0x30 (48) extent 1
> (size 8)
> ---------G---[---][---] OPAL_END_LOOP prev 2 elements first
> elem displacement 16 size of data 32
>
> This match perfectly to the datatype drawn by hand.
>
> George.
>
>
>
> On Nov 25, 2013, at 11:40 , Nadia Derbey <Nadia.Derbey_at_[hidden]
> <mailto:Nadia.Derbey_at_[hidden]>> wrote:
>
>> Hi,
>>
>> I'm currently working on a bug occuring at the client site with
>> openmpi when calling MPI_Sendreceive() on datatypes built by the
>> application.
>> I think I've found where the bug comes from (it is located in
>> opal_generic_simple_pack_function() - file
>> opal/datatype/opal_datatype_pack.c). But this code is so complicated
>> that I'm more than unsure of my fix. What I can say is that it fixes
>> things for me, but I need some advices from the datatypes specialists.
>>
>> ---------------
>>
>> You will find in attachment the reproducer provided by the client, as
>> well as the resulting output.
>> datatypes.c : reproducer
>> to run the binary: salloc --exclusive -p B510 -N 1 -n 1 mpirun
>> ./datatypes
>> trc_ko: traces got without the patch applied
>> trc_ok: traces got with the patch applied.
>>
>> ---------------
>>
>> The proposed patch is the following: (Note that the very first change
>> in this patch was enough in my case, but I thought all the
>> "source_base" settings should follow this model.)
>>
>> -------------------------
>> opal_generic_simple_pack_function: add the datatype lb when
>> progressing in the input buffer
>>
>> diff -r cb23c2f07e1f opal/datatype/opal_datatype_pack.c
>> --- a/opal/datatype/opal_datatype_pack.c Sun Nov 24 17:06:51
>> 2013 +0000
>> +++ b/opal/datatype/opal_datatype_pack.c Mon Nov 25 10:48:00
>> 2013 +0100
>> @@ -301,7 +301,7 @@ opal_generic_simple_pack_function( opal_
>> PACK_PREDEFINED_DATATYPE( pConvertor, pElem, count_desc,
>> source_base, destination,
>> iov_len_local );
>> if( 0 == count_desc ) { /* completed */
>> - source_base = pConvertor->pBaseBuf + pStack->disp;
>> + source_base = pConvertor->pBaseBuf +
>> pStack->disp + pData->lb;
>> pos_desc++; /* advance to the next data */
>> UPDATE_INTERNAL_COUNTERS( description, pos_desc,
>> pElem, count_desc );
>> continue;
>> @@ -333,7 +333,7 @@ opal_generic_simple_pack_function( opal_
>> pStack->disp +=
>> description[pStack->index].loop.extent;
>> }
>> }
>> - source_base = pConvertor->pBaseBuf + pStack->disp;
>> + source_base = pConvertor->pBaseBuf + pStack->disp +
>> pData->lb;
>> UPDATE_INTERNAL_COUNTERS( description, pos_desc,
>> pElem, count_desc );
>> DO_DEBUG( opal_output( 0, "pack new_loop count %d
>> stack_pos %d pos_desc %d disp %ld space %lu\n",
>> (int)pStack->count, pConvertor->stack_pos, pos_desc,
>> (long)pStack->disp, (unsigned long)iov_len_local ); );
>> @@ -354,7 +354,7 @@ opal_generic_simple_pack_function( opal_
>> pStack->disp + local_disp);
>> pos_desc++;
>> update_loop_description: /* update the current state */
>> - source_base = pConvertor->pBaseBuf + pStack->disp;
>> + source_base = pConvertor->pBaseBuf + pStack->disp +
>> pData->lb;
>> UPDATE_INTERNAL_COUNTERS( description, pos_desc,
>> pElem, count_desc );
>> DDT_DUMP_STACK( pConvertor->pStack,
>> pConvertor->stack_pos, pElem, "advance loop" );
>> continue;
>> @@ -374,7 +374,7 @@ opal_generic_simple_pack_function( opal_
>> }
>> /* I complete an element, next step I should go to the next one */
>> PUSH_STACK( pStack, pConvertor->stack_pos, pos_desc,
>> OPAL_DATATYPE_INT8, count_desc,
>> - source_base - pStack->disp - pConvertor->pBaseBuf );
>> + source_base - pStack->disp - pConvertor->pBaseBuf -
>> pData->lb );
>> DO_DEBUG( opal_output( 0, "pack save stack stack_pos %d pos_desc
>> %d count_desc %d disp %ld\n",
>> pConvertor->stack_pos, pStack->index,
>> (int)pStack->count, (long)pStack->disp ); );
>> return 0;
>>
>> -------------------------------
>>
>> Regards,
>> Nadia
>> --
>> Nadia Derbey
>> Bull, Architect of an Open World
>> http://www.bull.com
>> <datatypes.c><trc_ko.txt><trc_ok.txt>_______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Nadia Derbey
Bull, Architect of an Open World
http://www.bull.com



picture