Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fwd: Purify found bugs inside open-mpi library
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-29 16:06:48


Actually, I think your program is erroneous -- it looks like you're
using number of bytes for the sizes[] array when it really should be
using number of elements. Specifically, it should be:

     sizes[0] = (int) sizeof(tstruct.one);
     sizes[1] = 1;
     sizes[2] = 1;
     sizes[3] = 1;

Since MPI knows the sizes of datatypes, you specify counts of
datatypes -- not numbers of bytes.

That seemed to make your program work for me; double check and ensure
that it works for you.

On Apr 29, 2009, at 1:21 PM, Brian Blank wrote:

> To Whom This May Concern:
>
> I originally sent this to the users list, but realizing now that
> this might be more appropriate for the developer's list as it is
> dealing with issues internal to the openmpi library (sorry for the
> dual distribution). Please start with second email first.
>
> Thanks,
> Brian Blank
>
> ---------- Forwarded message ----------
> From: Brian Blank <brianblank_at_[hidden]>
> Date: Wed, Apr 29, 2009 at 1:09 PM
> Subject: Re: Purify found bugs inside open-mpi library
> To: users_at_[hidden]
>
>
> To Whom This May Concern:
>
> I've been trying to dig a little deeper into this problem and found
> some additional information.
>
> First, the stack trace for the ABR and ABW were different. The ABR
> problem occurred in datatype_pack.h while the ABW problem occurred
> in datatype_unpack.h. The problem appears to be the same still.
> Both errors are occurring during a call to MEMCPY_CSUM().
>
> I also found there are two different variables playing into this
> bug. There is _copy_blength and _copy_count. At the top of the
> function, both of these variables are set to 2 bytes for MPI_SHORT,
> 4 bytes for MPI_LONG, and 8 bytes for MPI_DOUBLE. Then, these
> variables are multiplied together to get the size of the memcpy().
> Unfortunetly, the correct size are either of these variables before
> they are squared. There sometimes appears to be an optimization
> where if two variables are next to each other, they are sometimes
> converted into a MPI_BYTE where the size is also incorrectly taking
> these squared values into consideration.
>
> I wrote a small test program to illustrate the problem and attached
> it to this email. First, I configured openmpi 1.3.2 with the
> following options:
>
> ./configure --prefix=/myworkingdirectory/openmpi-1.3.2.local --
> disable-mpi-f77 --disable-mpi-f90 --enable-debug --enable-mem-debug
> --enable-mem-profile
>
> I then modified datatype_pack.h & datatype_unpack.h located in
> openmpi-1.3.2/ompi/datatype directory in order to produce additional
> debugging output. The new versions are attached to this email.
>
> Then, I executed "make all install"
>
> Then, I write the attached test.c program. You can find the output
> below. The problems appear in red.
>
> 0: sizes '3' '4' '8' '2'
> 0: offsets '0' '4' '8' '16'
> 0: addresses '134510640' '134510644' '134510648' '134510656'
> 0: name='MPI_CHAR' _copy_blength='3' orig_copy_blength='1'
> _copy_count='3' _source='134510640'
> 0: name='MPI_LONG' _copy_blength='16' orig_copy_blength='4'
> _copy_count='4' _source='134510644'
> 0: name='MPI_DOUBLE' _copy_blength='64' orig_copy_blength='8'
> _copy_count='8' _source='134510648'
> 0: name='MPI_SHORT' _copy_blength='4' orig_copy_blength='2'
> _copy_count='2' _source='134510656'
> 0: one='22' two='222' three='33.300000' four='44'
> 1: sizes '3' '4' '8' '2'
> 1: offsets '0' '4' '8' '16'
> 1: addresses '134510640' '134510644' '134510648' '134510656'
> 1: name='MPI_CHAR' _copy_blength='3' orig_copy_blength='1'
> _copy_count='3' _destination='134510640'
> 1: name='MPI_LONG' _copy_blength='16' orig_copy_blength='4'
> _copy_count='4' _destination='134510644'
> 1: name='MPI_DOUBLE' _copy_blength='64' orig_copy_blength='8'
> _copy_count='8' _destination='134510648'
> 1: name='MPI_SHORT' _copy_blength='4' orig_copy_blength='2'
> _copy_count='2' _destination='134510656'
> 1: one='22' two='222' three='33.300000' four='44'
>
> You can see from the output that the MPI_Send & MPI_Recv functions
> are reading or writing too much data from my structure, causing an
> overflow condition to occur. I believe this is causing my
> application to crash.
>
> Any help on this problem would be appreciated. If there is anything
> else that you need from me, just let me know.
>
> Thanks,
> Brian
>
>
>
>
> On Tue, Apr 28, 2009 at 9:58 PM, Brian Blank <brianblank_at_[hidden]>
> wrote:
> To Whom This May Concern:
>
> I am having problems with an OpenMPI application I am writing on the
> Solaris/Intel AMD64 platform. I am using OpenMPI 1.3.2 which I
> compiled myself using the Solaris C/C++ compiler.
>
> My application was crashing (signal 11) inside a call to malloc()
> which my code was running. I thought it might be a memory overflow
> error that was causing this, so I fired up Purify. Purify found
> several problems inside the the OpenMPI library. I think one of the
> errors is serious and might be causing the problems I was looking for.
>
> The serious error is an Array Bounds Write (ABW) which occurred 824
> times through 312 calls to MPI_Recv(). This error means that
> something inside the OpenMPI library is writing to memory where it
> shouldn't be writing to. Here is the stack trace at the time of
> this error:
>
> Stack Trace 1 (Occurred 596 times)
>
> memcpy rtlib.o
> unpack_predefined_data [datatype_unpack.h:41]
> MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
> ompi_generic_simple_unpack [datatype_unpack.c:419]
> ompi_convertor_unpack [convertor.c:314]
> mca_pml_ob1_recv_frag_callback_match [pml_ob1_recvfrag.c:218]
> mca_btl_sm_component_progress [btl_sm_component.c:427]
> opal_progress [opal_progress.c:207]
> opal_condition_wait [condition.h:99]
> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768
> illegal).>
> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0
> of 664 bytes.>
>
> Stack Trace 2 (Occurred 228 times)
>
> memcpy rtlib.o
> unpack_predefined_data [datatype_unpack.h:41]
> MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
> ompi_generic_simple_unpack [datatype_unpack.c:419]
> ompi_convertor_unpack [convertor.c:314]
> mca_pml_ob1_recv_request_progress_match [pml_ob1_recvreq.c:624]
> mca_pml_ob1_Recv_req_start [pml_ob1_recvreq.c:1008]
> mca_pml_ob1_recv [pml_ob1_irecv.c:103]
> MPI_Recv [precv.c:75]
> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768
> illegal).>
> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0
> of 664 bytes.>
>
>
> I'm not that familiar with the under workings of the openmpi
> library, but I tried to debug it anyway. I noticed that it was
> copying a lot of extra bytes for MPI_LONG and MPI_DOUBLE types. On
> my system, MPI_LONG should have been 4 bytes, but was copying 16
> bytes. Also, MPI_DOUBLE should have been 8 bytes, but was copying
> 64 bytes. It seems the _copy_blength variable was being set to
> high, but I'm not sure why. The above error also shows 64 bytes
> being read, where my debugging shows a 64 byte copy for all
> MPI_DOUBLE types, which I feel should have been 8 bytes. Therefore,
> I really believe _copy_blength is being set to high.
>
>
> Interestingly enough, the call to MPI_Isend() was generating an ABR
> (Array Bounds Read) error in the exact same line of code. The ABR
> error sometimes can be fatal if the bytes being read is not a legal
> address, but the ABW error is usually a much more fatal error
> because it is definitely writing into memory that is probably used
> for something else. I'm sure that if we fix the ABW error, the ABR
> error should fix itself too as it's the same line of code.
>
> Purify also found 14 UMR (Uninitialized memory read) errors inside
> the OpenMPI library. Sometimes this can be really bad if there are
> any decisions being made as a result of reading this memory
> location. But for now, let's solve the serious error I reported
> above first, then I will send you the UMR errors next.
>
> Any help you can provide would be greatly appreciated.
>
> Thanks,
> Brian
>
>
>
> <
> datatype_pack
> .h
> >
> <
> datatype_unpack
> .h><test.c>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems