Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Possible bug with derived datatypes and openib BTL in trunk
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-04-16 16:23:41


Rolf,

I didn’t see these on my check run. Can you run the MPI_Isend_ator test with mpi_ddt_pack_debug and mpi_ddt_unpack_debug set to 1. I would be interested in the output you get on your machine.

George.

On Apr 16, 2014, at 14:34 , Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:

> I have seen errors when running the intel test suite using the openib BTL when transferring derived datatypes. I do not see the error with sm or tcp BTLs. The errors begin after this checkin.
>
> https://svn.open-mpi.org/trac/ompi/changeset/31370
> Timestamp: 04/11/14 16:06:56 (5 days ago)
> Author: bosilca
> Message: Reshape all the packing/unpacking functions to use the same skeleton. Rewrite the
> generic_unpacking to take advantage of the same capabilitites.
>
> Does anyone else see errors? Here is an example running with r31370:
>
> [rvandevaart_at_drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2 -host drossetti-ivy0,drossetti-ivy1 --mca btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c
> MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117
> MPITEST error (1): libmpitest.c:1578 i=195, char value=-1, expected -61
> MPITEST error (1): 2 errors in buffer (17,0,12) len 273 commsize 2 commtype -10 data_type 13 root 1
> MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117
> MPITEST error (1): libmpitest.c:1578 i=195, char value=-1, expected -61
> MPITEST error (1): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16 data_type 13 root 1
> MPITEST info (0): Starting MPI_Isend_ator: All Isend TO Root test
> MPITEST info (0): Node spec MPITEST_comm_sizes[6]=2 too large, using 1
> MPITEST info (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1
> MPITEST info (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1
> MPITEST error (0): libmpitest.c:1608 i=117, int32_t value=-1, expected 118
> MPITEST error (0): libmpitest.c:1578 i=195, char value=-1, expected -60
> MPITEST error (0): 2 errors in buffer (17,0,12) len 273 commsize 2 commtype -10 data_type 13 root 0
> MPITEST error (0): libmpitest.c:1608 i=117, int32_t value=-1, expected 118
> MPITEST error (0): libmpitest.c:1578 i=195, char value=-1, expected -60
> MPITEST error (0): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16 data_type 13 root 0
> MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117
> MPITEST error (1): libmpitest.c:1578 i=195, char value=-1, expected -61
> MPITEST error (1): 2 errors in buffer (17,4,12) len 273 commsize 2 commtype -13 data_type 13 root 1
> MPITEST error (0): libmpitest.c:1608 i=117, int32_t value=-1, expected 118
> MPITEST error (0): libmpitest.c:1578 i=195, char value=-1, expected -60
> MPITEST error (0): 2 errors in buffer (17,4,12) len 273 commsize 2 commtype -13 data_type 13 root 0
> MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117
> MPITEST error (1): libmpitest.c:1578 i=195, char value=-1, expected -61
> MPITEST error (1): 2 errors in buffer (17,6,12) len 273 commsize 2 commtype -15 data_type 13 root 0
> MPITEST error (0): libmpitest.c:1608 i=117, int32_t value=-1, expected 117
> MPITEST error (0): libmpitest.c:1578 i=195, char value=-1, expected -61
> MPITEST error (0): 2 errors in buffer (17,6,12) len 273 commsize 2 commtype -15 data_type 13 root 0
> MPITEST_results: MPI_Isend_ator: All Isend TO Root 8 tests FAILED (of 3744)
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus causing
> the job to be terminated. The first process to do so was:
>
> Process name: [[12363,1],0]
> Exit code: 4
> --------------------------------------------------------------------------
> [rvandevaart_at_drossetti-ivy1 src]$
>
>
> Here is an error with the trunk which is slightly different.
> [rvandevaart_at_drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2 -host drossetti-ivy0,drossetti-ivy1 --mca btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c
> [drossetti-ivy1.nvidia.com:22875] ../../../opal/datatype/opal_datatype_position.c:72
> Pointer 0x1ad414c size 4 is outside [0x1ac1d20,0x1ad1d08] for
> base ptr 0x1ac1d20 count 273 and data
> [drossetti-ivy1.nvidia.com:22875] Datatype 0x1ac0220[] size 104 align 16 id 0 length 22 used 21
> true_lb 0 true_ub 232 (true_extent 232) lb 0 ub 240 (extent 240)
> nbElems 21 loops 0 flags 1C4 (commited )-c--lu-GD--[---][---]
> contain lb ub OPAL_LB OPAL_UB OPAL_INT1 OPAL_INT2 OPAL_INT4 OPAL_INT8 OPAL_UINT1 OPAL_UINT2 OPAL_UINT4 OPAL_UINT8 OPAL_FLOAT4 OPAL_FLOAT8 OPAL_FLOAT16
> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4)
> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2)
> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size 8)
> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size 2)
> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size 4)
> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size 8)
> --C---P-D--[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size 4)
> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size 1)
> --C---P-D--[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size 8)
> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size 1)
> --C---P-D--[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size 16)
> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size 1)
> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size 1)
> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size 2)
> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size 2)
> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size 4)
> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size 4)
> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size 8)
> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size 8)
> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size 8)
> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size 8)
> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem displacement 0 size of data 104
> Optimized description
> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4)
> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2)
> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size 8)
> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size 2)
> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size 4)
> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size 8)
> -cC---P-DB-[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size 4)
> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size 1)
> -cC---P-DB-[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size 8)
> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size 1)
> -cC---P-DB-[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size 16)
> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size 1)
> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size 1)
> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size 2)
> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size 2)
> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size 4)
> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size 4)
> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size 8)
> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size 8)
> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size 8)
> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size 8)
> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem displacement 0 size of data 104
>
> MPITEST error (1): libmpitest.c:1578 i=0, char value=-61, expected 0
> MPITEST error (1): libmpitest.c:1608 i=0, int32_t value=117, expected 0
> MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117
> MPITEST error (1): libmpitest.c:1578 i=195, char value=-1, expected -61
> MPITEST error (1): 4 errors in buffer (17,0,12) len 273 commsize 2 commtype -10 data_type 13 root 1
> MPITEST info (0): Starting MPI_Isend_ator: All Isend TO Root test
> MPITEST info (0): Node spec MPITEST_comm_sizes[6]=2 too large, using 1
> MPITEST info (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1
> MPITEST info (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1
> MPITEST_results: MPI_Isend_ator: All Isend TO Root 1 tests FAILED (of 3744)
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus causing
> the job to be terminated. The first process to do so was:
>
> Process name: [[12296,1],1]
> Exit code: 1
> --------------------------------------------------------------------------
> [rvandevaart_at_drossetti-ivy1 src]$
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information. Any unauthorized review, use, disclosure or distribution
> is prohibited. If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/04/14553.php