Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Possible bug with derived datatypes and openib BTL in trunk
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2014-04-17 11:28:21


I sent this information to George off the mailing list since the attachment was somewhat large.
Still strange that I guess I am the only one that sees this.

>-----Original Message-----
>From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>Bosilca
>Sent: Wednesday, April 16, 2014 4:24 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] Possible bug with derived datatypes and openib
>BTL in trunk
>
>Rolf,
>
>I didn't see these on my check run. Can you run the MPI_Isend_ator test with
>mpi_ddt_pack_debug and mpi_ddt_unpack_debug set to 1. I would be
>interested in the output you get on your machine.
>
>George.
>
>
>On Apr 16, 2014, at 14:34 , Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
>
>> I have seen errors when running the intel test suite using the openib BTL
>when transferring derived datatypes. I do not see the error with sm or tcp
>BTLs. The errors begin after this checkin.
>>
>> https://svn.open-mpi.org/trac/ompi/changeset/31370
>> Timestamp: 04/11/14 16:06:56 (5 days ago)
>> Author: bosilca
>> Message: Reshape all the packing/unpacking functions to use the same
>> skeleton. Rewrite the generic_unpacking to take advantage of the same
>capabilitites.
>>
>> Does anyone else see errors? Here is an example running with r31370:
>>
>> [rvandevaart_at_drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2
>> -host drossetti-ivy0,drossetti-ivy1 --mca
>> btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c MPITEST error
>> (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117 MPITEST
>> error (1): libmpitest.c:1578 i=195, char value=-1, expected -61
>> MPITEST error (1): 2 errors in buffer (17,0,12) len 273 commsize 2
>> commtype -10 data_type 13 root 1 MPITEST error (1): libmpitest.c:1608
>> i=117, int32_t value=-1, expected 117 MPITEST error (1):
>> libmpitest.c:1578 i=195, char value=-1, expected -61 MPITEST error
>> (1): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16
>> data_type 13 root 1 MPITEST info (0): Starting MPI_Isend_ator: All
>> Isend TO Root test MPITEST info (0): Node spec
>> MPITEST_comm_sizes[6]=2 too large, using 1 MPITEST info (0): Node
>> spec MPITEST_comm_sizes[22]=2 too large, using 1 MPITEST info (0):
>> Node spec MPITEST_comm_sizes[32]=2 too large, using 1 MPITEST error
>> (0): libmpitest.c:1608 i=117, int32_t value=-1, expected 118 MPITEST
>> error (0): libmpitest.c:1578 i=195, char value=-1, expected -60
>> MPITEST error (0): 2 errors in buffer (17,0,12) len 273 commsize 2
>> commtype -10 data_type 13 root 0 MPITEST error (0): libmpitest.c:1608
>> i=117, int32_t value=-1, expected 118 MPITEST error (0):
>> libmpitest.c:1578 i=195, char value=-1, expected -60 MPITEST error
>> (0): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16
>> data_type 13 root 0 MPITEST error (1): libmpitest.c:1608 i=117,
>> int32_t value=-1, expected 117 MPITEST error (1): libmpitest.c:1578
>> i=195, char value=-1, expected -61 MPITEST error (1): 2 errors in
>> buffer (17,4,12) len 273 commsize 2 commtype -13 data_type 13 root 1
>> MPITEST error (0): libmpitest.c:1608 i=117, int32_t value=-1, expected
>> 118 MPITEST error (0): libmpitest.c:1578 i=195, char value=-1,
>> expected -60 MPITEST error (0): 2 errors in buffer (17,4,12) len 273
>> commsize 2 commtype -13 data_type 13 root 0 MPITEST error (1):
>> libmpitest.c:1608 i=117, int32_t value=-1, expected 117 MPITEST error
>> (1): libmpitest.c:1578 i=195, char value=-1, expected -61 MPITEST
>> error (1): 2 errors in buffer (17,6,12) len 273 commsize 2 commtype
>> -15 data_type 13 root 0 MPITEST error (0): libmpitest.c:1608 i=117,
>> int32_t value=-1, expected 117 MPITEST error (0): libmpitest.c:1578
>> i=195, char value=-1, expected -61 MPITEST error (0): 2 errors in
>> buffer (17,6,12) len 273 commsize 2 commtype -15 data_type 13 root 0
>> MPITEST_results: MPI_Isend_ator: All Isend TO Root 8 tests FAILED (of
>> 3744)
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned a non-zero
>> exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> ----------------------------------------------------------------------
>> ---- mpirun detected that one or more processes exited with non-zero
>> status, thus causing the job to be terminated. The first process to do
>> so was:
>>
>> Process name: [[12363,1],0]
>> Exit code: 4
>> ----------------------------------------------------------------------
>> ----
>> [rvandevaart_at_drossetti-ivy1 src]$
>>
>>
>> Here is an error with the trunk which is slightly different.
>> [rvandevaart_at_drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2
>> -host drossetti-ivy0,drossetti-ivy1 --mca
>btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c [drossetti-
>ivy1.nvidia.com:22875] ../../../opal/datatype/opal_datatype_position.c:72
>> Pointer 0x1ad414c size 4 is outside [0x1ac1d20,0x1ad1d08] for
>> base ptr 0x1ac1d20 count 273 and data
>> [drossetti-ivy1.nvidia.com:22875] Datatype 0x1ac0220[] size 104 align
>> 16 id 0 length 22 used 21 true_lb 0 true_ub 232 (true_extent 232) lb 0
>> ub 240 (extent 240) nbElems 21 loops 0 flags 1C4 (commited )-c--lu-GD--[---
>][---]
>> contain lb ub OPAL_LB OPAL_UB OPAL_INT1 OPAL_INT2 OPAL_INT4
>OPAL_INT8 OPAL_UINT1 OPAL_UINT2 OPAL_UINT4 OPAL_UINT8
>OPAL_FLOAT4 OPAL_FLOAT8 OPAL_FLOAT16
>> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4)
>> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2)
>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size 8)
>> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size 2)
>> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size 4)
>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size 8)
>> --C---P-D--[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size 4)
>> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size 1)
>> --C---P-D--[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size 8)
>> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size 1)
>> --C---P-D--[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size
>16)
>> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size 1)
>> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size 1)
>> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size 2)
>> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size 2)
>> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size 4)
>> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size 4)
>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size 8)
>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size 8)
>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size 8)
>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size 8)
>> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem
>> displacement 0 size of data 104 Optimized description
>> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4)
>> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2)
>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size 8)
>> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size 2)
>> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size 4)
>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size 8)
>> -cC---P-DB-[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size 4)
>> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size 1)
>> -cC---P-DB-[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size 8)
>> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size 1)
>> -cC---P-DB-[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size
>16)
>> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size 1)
>> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size 1)
>> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size 2)
>> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size 2)
>> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size 4)
>> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size 4)
>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size 8)
>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size 8)
>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size 8)
>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size 8)
>> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem
>> displacement 0 size of data 104
>>
>> MPITEST error (1): libmpitest.c:1578 i=0, char value=-61, expected 0
>> MPITEST error (1): libmpitest.c:1608 i=0, int32_t value=117, expected
>> 0 MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1,
>> expected 117 MPITEST error (1): libmpitest.c:1578 i=195, char
>> value=-1, expected -61 MPITEST error (1): 4 errors in buffer (17,0,12)
>> len 273 commsize 2 commtype -10 data_type 13 root 1 MPITEST info (0):
>> Starting MPI_Isend_ator: All Isend TO Root test MPITEST info (0):
>> Node spec MPITEST_comm_sizes[6]=2 too large, using 1 MPITEST info
>> (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1 MPITEST
>> info (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1
>> MPITEST_results: MPI_Isend_ator: All Isend TO Root 1 tests FAILED (of
>> 3744)
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned a non-zero
>> exit code.. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> ----------------------------------------------------------------------
>> ---- mpirun detected that one or more processes exited with non-zero
>> status, thus causing the job to be terminated. The first process to do
>> so was:
>>
>> Process name: [[12296,1],1]
>> Exit code: 1
>> ----------------------------------------------------------------------
>> ----
>> [rvandevaart_at_drossetti-ivy1 src]$
>>
>> ----------------------------------------------------------------------
>> ------------- This email message is for the sole use of the intended
>> recipient(s) and may contain confidential information. Any
>> unauthorized review, use, disclosure or distribution is prohibited.
>> If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> ----------------------------------------------------------------------
>> ------------- _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/04/14553.php
>
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: http://www.open-
>mpi.org/community/lists/devel/2014/04/14554.php