Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Possible bug with derived datatypes and openib BTL in trunk
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-05-06 22:39:22


I went over the provided trace file and I tried to force the BTLs to
handle extremely weird (and uncomfortable) lengths, on both Mac OS X
and Linux 64b. Despite all my efforts I was unable to reproduce this
error. So I'm giving up until more information become available.

  George.

On Thu, Apr 17, 2014 at 11:28 AM, Rolf vandeVaart
<rvandevaart_at_[hidden]> wrote:
> I sent this information to George off the mailing list since the attachment was somewhat large.
> Still strange that I guess I am the only one that sees this.
>
>>-----Original Message-----
>>From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>>Bosilca
>>Sent: Wednesday, April 16, 2014 4:24 PM
>>To: Open MPI Developers
>>Subject: Re: [OMPI devel] Possible bug with derived datatypes and openib
>>BTL in trunk
>>
>>Rolf,
>>
>>I didn't see these on my check run. Can you run the MPI_Isend_ator test with
>>mpi_ddt_pack_debug and mpi_ddt_unpack_debug set to 1. I would be
>>interested in the output you get on your machine.
>>
>>George.
>>
>>
>>On Apr 16, 2014, at 14:34 , Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
>>
>>> I have seen errors when running the intel test suite using the openib BTL
>>when transferring derived datatypes. I do not see the error with sm or tcp
>>BTLs. The errors begin after this checkin.
>>>
>>> https://svn.open-mpi.org/trac/ompi/changeset/31370
>>> Timestamp: 04/11/14 16:06:56 (5 days ago)
>>> Author: bosilca
>>> Message: Reshape all the packing/unpacking functions to use the same
>>> skeleton. Rewrite the generic_unpacking to take advantage of the same
>>capabilitites.
>>>
>>> Does anyone else see errors? Here is an example running with r31370:
>>>
>>> [rvandevaart_at_drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2
>>> -host drossetti-ivy0,drossetti-ivy1 --mca
>>> btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c MPITEST error
>>> (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117 MPITEST
>>> error (1): libmpitest.c:1578 i=195, char value=-1, expected -61
>>> MPITEST error (1): 2 errors in buffer (17,0,12) len 273 commsize 2
>>> commtype -10 data_type 13 root 1 MPITEST error (1): libmpitest.c:1608
>>> i=117, int32_t value=-1, expected 117 MPITEST error (1):
>>> libmpitest.c:1578 i=195, char value=-1, expected -61 MPITEST error
>>> (1): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16
>>> data_type 13 root 1 MPITEST info (0): Starting MPI_Isend_ator: All
>>> Isend TO Root test MPITEST info (0): Node spec
>>> MPITEST_comm_sizes[6]=2 too large, using 1 MPITEST info (0): Node
>>> spec MPITEST_comm_sizes[22]=2 too large, using 1 MPITEST info (0):
>>> Node spec MPITEST_comm_sizes[32]=2 too large, using 1 MPITEST error
>>> (0): libmpitest.c:1608 i=117, int32_t value=-1, expected 118 MPITEST
>>> error (0): libmpitest.c:1578 i=195, char value=-1, expected -60
>>> MPITEST error (0): 2 errors in buffer (17,0,12) len 273 commsize 2
>>> commtype -10 data_type 13 root 0 MPITEST error (0): libmpitest.c:1608
>>> i=117, int32_t value=-1, expected 118 MPITEST error (0):
>>> libmpitest.c:1578 i=195, char value=-1, expected -60 MPITEST error
>>> (0): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16
>>> data_type 13 root 0 MPITEST error (1): libmpitest.c:1608 i=117,
>>> int32_t value=-1, expected 117 MPITEST error (1): libmpitest.c:1578
>>> i=195, char value=-1, expected -61 MPITEST error (1): 2 errors in
>>> buffer (17,4,12) len 273 commsize 2 commtype -13 data_type 13 root 1
>>> MPITEST error (0): libmpitest.c:1608 i=117, int32_t value=-1, expected
>>> 118 MPITEST error (0): libmpitest.c:1578 i=195, char value=-1,
>>> expected -60 MPITEST error (0): 2 errors in buffer (17,4,12) len 273
>>> commsize 2 commtype -13 data_type 13 root 0 MPITEST error (1):
>>> libmpitest.c:1608 i=117, int32_t value=-1, expected 117 MPITEST error
>>> (1): libmpitest.c:1578 i=195, char value=-1, expected -61 MPITEST
>>> error (1): 2 errors in buffer (17,6,12) len 273 commsize 2 commtype
>>> -15 data_type 13 root 0 MPITEST error (0): libmpitest.c:1608 i=117,
>>> int32_t value=-1, expected 117 MPITEST error (0): libmpitest.c:1578
>>> i=195, char value=-1, expected -61 MPITEST error (0): 2 errors in
>>> buffer (17,6,12) len 273 commsize 2 commtype -15 data_type 13 root 0
>>> MPITEST_results: MPI_Isend_ator: All Isend TO Root 8 tests FAILED (of
>>> 3744)
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned a non-zero
>>> exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ---- mpirun detected that one or more processes exited with non-zero
>>> status, thus causing the job to be terminated. The first process to do
>>> so was:
>>>
>>> Process name: [[12363,1],0]
>>> Exit code: 4
>>> ----------------------------------------------------------------------
>>> ----
>>> [rvandevaart_at_drossetti-ivy1 src]$
>>>
>>>
>>> Here is an error with the trunk which is slightly different.
>>> [rvandevaart_at_drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2
>>> -host drossetti-ivy0,drossetti-ivy1 --mca
>>btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c [drossetti-
>>ivy1.nvidia.com:22875] ../../../opal/datatype/opal_datatype_position.c:72
>>> Pointer 0x1ad414c size 4 is outside [0x1ac1d20,0x1ad1d08] for
>>> base ptr 0x1ac1d20 count 273 and data
>>> [drossetti-ivy1.nvidia.com:22875] Datatype 0x1ac0220[] size 104 align
>>> 16 id 0 length 22 used 21 true_lb 0 true_ub 232 (true_extent 232) lb 0
>>> ub 240 (extent 240) nbElems 21 loops 0 flags 1C4 (commited )-c--lu-GD--[---
>>][---]
>>> contain lb ub OPAL_LB OPAL_UB OPAL_INT1 OPAL_INT2 OPAL_INT4
>>OPAL_INT8 OPAL_UINT1 OPAL_UINT2 OPAL_UINT4 OPAL_UINT8
>>OPAL_FLOAT4 OPAL_FLOAT8 OPAL_FLOAT16
>>> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4)
>>> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2)
>>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size 8)
>>> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size 2)
>>> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size 4)
>>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size 8)
>>> --C---P-D--[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size 4)
>>> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size 1)
>>> --C---P-D--[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size 8)
>>> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size 1)
>>> --C---P-D--[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size
>>16)
>>> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size 1)
>>> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size 1)
>>> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size 2)
>>> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size 2)
>>> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size 4)
>>> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size 4)
>>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size 8)
>>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size 8)
>>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size 8)
>>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size 8)
>>> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem
>>> displacement 0 size of data 104 Optimized description
>>> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4)
>>> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2)
>>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size 8)
>>> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size 2)
>>> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size 4)
>>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size 8)
>>> -cC---P-DB-[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size 4)
>>> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size 1)
>>> -cC---P-DB-[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size 8)
>>> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size 1)
>>> -cC---P-DB-[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size
>>16)
>>> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size 1)
>>> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size 1)
>>> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size 2)
>>> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size 2)
>>> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size 4)
>>> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size 4)
>>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size 8)
>>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size 8)
>>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size 8)
>>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size 8)
>>> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem
>>> displacement 0 size of data 104
>>>
>>> MPITEST error (1): libmpitest.c:1578 i=0, char value=-61, expected 0
>>> MPITEST error (1): libmpitest.c:1608 i=0, int32_t value=117, expected
>>> 0 MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1,
>>> expected 117 MPITEST error (1): libmpitest.c:1578 i=195, char
>>> value=-1, expected -61 MPITEST error (1): 4 errors in buffer (17,0,12)
>>> len 273 commsize 2 commtype -10 data_type 13 root 1 MPITEST info (0):
>>> Starting MPI_Isend_ator: All Isend TO Root test MPITEST info (0):
>>> Node spec MPITEST_comm_sizes[6]=2 too large, using 1 MPITEST info
>>> (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1 MPITEST
>>> info (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1
>>> MPITEST_results: MPI_Isend_ator: All Isend TO Root 1 tests FAILED (of
>>> 3744)
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned a non-zero
>>> exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> ----------------------------------------------------------------------
>>> ---- mpirun detected that one or more processes exited with non-zero
>>> status, thus causing the job to be terminated. The first process to do
>>> so was:
>>>
>>> Process name: [[12296,1],1]
>>> Exit code: 1
>>> ----------------------------------------------------------------------
>>> ----
>>> [rvandevaart_at_drossetti-ivy1 src]$
>>>
>>> ----------------------------------------------------------------------
>>> ------------- This email message is for the sole use of the intended
>>> recipient(s) and may contain confidential information. Any
>>> unauthorized review, use, disclosure or distribution is prohibited.
>>> If you are not the intended recipient, please contact the sender by
>>> reply email and destroy all copies of the original message.
>>> ----------------------------------------------------------------------
>>> ------------- _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/04/14553.php
>>
>>_______________________________________________
>>devel mailing list
>>devel_at_[hidden]
>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>Link to this post: http://www.open-
>>mpi.org/community/lists/devel/2014/04/14554.php
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/04/14559.php