Nadia,

I guess the revisions mentioned are from HG? If I知 not mistaken the change you mentioned corresponds to r29285. I知 not sure if they are related, as r29285 is about positioning a convertor, and this is only used in the case of multi-fragments messages. As this is not the case for your example, I don稚 think they are related.

I guess we should look at all the patches in the opal/datatype and ompi/datatype over the last 13 months (the starting point of the 1.6.3).

  George.


On Nov 25, 2013, at 14:10 , Nadia Derbey <Nadia.Derbey@bull.net> wrote:

George,

Thx for the detailed answer!
I did my tests on a v1.6.2 (changeset: 141b22222759).
After you told me it worked for you with earlier releases, I looked at the changesets applied since that time. I guess 28fd94d282a3
is the one that fixes my issue?

Regards,
Nadia

On 25/11/2013 13:36, George Bosilca wrote:
Nadia,

Which version of Open MPI are you using? I tried with the nightly r29751, the current 1.6 and the current 1.7 and I __always__ got the expected output.

There is a simple way to show what the datatype engine is doing. You can set the MCA parameters
mpi_ddt_unpack_debug and mpi_ddt_pack_debug to get more info. If you only want to see how the datatype looks after the MPI_Commit step you can call directly ompi_datatype_dump(ddt). This will show the internals of the datatype, converted in predefined types.

As an example I took the application you provided and build the following picture of what is send and what is received (original buffer, send datatype, packed buffer, recv datatype, resulting buffer).

<Mail Attachment.png>

Now using the ompi_datatype_dump, I see the recv and the send datatypes as:

-cC---P-DB-[---][---]     OPAL_UINT1 count 8 disp 0x0 (0) extent 1 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 8 disp 0x10 (16) extent 1 (size 8)
-cC猶-DB-[余[余           OPAL_INT4 count 4 disp 0x30 (48) extent 4 (size 16)
覧覧覧-G夕余[余          OPAL_END_LOOP pref 3 elements first elem displacement 0 size of data 32

-cC---P-DB-[---][---]     OPAL_UINT1 count 24 disp 0x10 (16) extent 1 (size 24)
-cC---P-DB-[---][---]     OPAL_UINT1 count 8 disp 0x30 (48) extent 1 (size 8)
覧宥---[---][---]         OPAL_END_LOOP prev 2 elements first elem displacement 16 size of data 32

This match perfectly to the datatype drawn by hand.

  George.



On Nov 25, 2013, at 11:40 , Nadia Derbey <Nadia.Derbey@bull.net> wrote:

Hi,

I'm currently working on a bug occuring at the client site with openmpi when calling MPI_Sendreceive() on datatypes built by the application.
I think I've found where the bug comes from (it is located in opal_generic_simple_pack_function() - file opal/datatype/opal_datatype_pack.c). But this code is so complicated that I'm more than unsure of my fix. What I can say is that it fixes things for me, but I need some advices from the datatypes specialists.

---------------

You will find in attachment the reproducer provided by the client, as well as the resulting output.
datatypes.c : reproducer
to run the binary: salloc --exclusive -p B510 -N 1 -n 1 mpirun ./datatypes
trc_ko: traces got without the patch applied
trc_ok: traces got with the patch applied.

---------------

The proposed patch is the following: (Note that the very first change in this patch was enough in my case, but I thought all the "source_base" settings should follow this model.)

-------------------------
opal_generic_simple_pack_function: add the datatype lb when progressing in the input buffer

diff -r cb23c2f07e1f opal/datatype/opal_datatype_pack.c
--- a/opal/datatype/opal_datatype_pack.c        Sun Nov 24 17:06:51 2013 +0000
+++ b/opal/datatype/opal_datatype_pack.c        Mon Nov 25 10:48:00 2013 +0100
@@ -301,7 +301,7 @@ opal_generic_simple_pack_function( opal_
                 PACK_PREDEFINED_DATATYPE( pConvertor, pElem, count_desc,
                                           source_base, destination, iov_len_local );
                 if( 0 == count_desc ) {  /* completed */
-                    source_base = pConvertor->pBaseBuf + pStack->disp;
+                    source_base = pConvertor->pBaseBuf + pStack->disp + pData->lb;
                     pos_desc++;  /* advance to the next data */
                     UPDATE_INTERNAL_COUNTERS( description, pos_desc, pElem, count_desc );
                     continue;
@@ -333,7 +333,7 @@ opal_generic_simple_pack_function( opal_
                         pStack->disp += description[pStack->index].loop.extent;
                     }
                 }
-                source_base = pConvertor->pBaseBuf + pStack->disp;
+                source_base = pConvertor->pBaseBuf + pStack->disp + pData->lb;
                 UPDATE_INTERNAL_COUNTERS( description, pos_desc, pElem, count_desc );
                 DO_DEBUG( opal_output( 0, "pack new_loop count %d stack_pos %d pos_desc %d disp %ld space %lu\n",
                                        (int)pStack->count, pConvertor->stack_pos, pos_desc, (long)pStack->disp, (unsigned long)iov_len_local ); );
@@ -354,7 +354,7 @@ opal_generic_simple_pack_function( opal_
                             pStack->disp + local_disp);
                 pos_desc++;
             update_loop_description:  /* update the current state */
-                source_base = pConvertor->pBaseBuf + pStack->disp;
+                source_base = pConvertor->pBaseBuf + pStack->disp + pData->lb;
                 UPDATE_INTERNAL_COUNTERS( description, pos_desc, pElem, count_desc );
                 DDT_DUMP_STACK( pConvertor->pStack, pConvertor->stack_pos, pElem, "advance loop" );
                 continue;
@@ -374,7 +374,7 @@ opal_generic_simple_pack_function( opal_
     }
     /* I complete an element, next step I should go to the next one */
     PUSH_STACK( pStack, pConvertor->stack_pos, pos_desc, OPAL_DATATYPE_INT8, count_desc,
-                source_base - pStack->disp - pConvertor->pBaseBuf );
+                source_base - pStack->disp - pConvertor->pBaseBuf - pData->lb );
     DO_DEBUG( opal_output( 0, "pack save stack stack_pos %d pos_desc %d count_desc %d disp %ld\n",
                            pConvertor->stack_pos, pStack->index, (int)pStack->count, (long)pStack->disp ); );
     return 0;

-------------------------------

Regards,
Nadia
-- 
Nadia Derbey
Bull, Architect of an Open World
http://www.bull.com
<datatypes.c><trc_ko.txt><trc_ok.txt>_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Nadia Derbey
Bull, Architect of an Open World
http://www.bull.com
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel