Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fwd: Purify found bugs inside open-mpi library
From: Brian Blank (brianblank_at_[hidden])
Date: 2009-04-29 17:03:44


Hi Jeff,

That definetly worked for me. Thanks so much for you help.

Purify did find some other UMR (unitialize memory read) errors though,
but they don't seem to be negativley impacting my application right
now. Nonetheless, I'll post them later today in case anyone is
interested in them.

Just to give you a sample of what it see's now, one of the UMR errors
seems a little odd ... paffinity_solaris_module.c line 180.
if (P_ONLINE == pinfo.pi_state || P_NOINTR == pinfo.pi_state) {

I'll post the rest of the UMR errors later today.

Thanks again,
Brian

On Apr 29, 2009, at 4:06 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> Actually, I think your program is erroneous -- it looks like you're
> using number of bytes for the sizes[] array when it really should be
> using number of elements. Specifically, it should be:
>
> sizes[0] = (int) sizeof(tstruct.one);
> sizes[1] = 1;
> sizes[2] = 1;
> sizes[3] = 1;
>
> Since MPI knows the sizes of datatypes, you specify counts of
> datatypes -- not numbers of bytes.
>
> That seemed to make your program work for me; double check and
> ensure that it works for you.
>
>
> On Apr 29, 2009, at 1:21 PM, Brian Blank wrote:
>
>> To Whom This May Concern:
>>
>> I originally sent this to the users list, but realizing now that
>> this might be more appropriate for the developer's list as it is
>> dealing with issues internal to the openmpi library (sorry for the
>> dual distribution). Please start with second email first.
>>
>> Thanks,
>> Brian Blank
>>
>> ---------- Forwarded message ----------
>> From: Brian Blank <brianblank_at_[hidden]>
>> Date: Wed, Apr 29, 2009 at 1:09 PM
>> Subject: Re: Purify found bugs inside open-mpi library
>> To: users_at_[hidden]
>>
>>
>> To Whom This May Concern:
>>
>> I've been trying to dig a little deeper into this problem and found
>> some additional information.
>>
>> First, the stack trace for the ABR and ABW were different. The ABR
>> problem occurred in datatype_pack.h while the ABW problem occurred
>> in datatype_unpack.h. The problem appears to be the same still.
>> Both errors are occurring during a call to MEMCPY_CSUM().
>>
>> I also found there are two different variables playing into this
>> bug. There is _copy_blength and _copy_count. At the top of the
>> function, both of these variables are set to 2 bytes for MPI_SHORT,
>> 4 bytes for MPI_LONG, and 8 bytes for MPI_DOUBLE. Then, these
>> variables are multiplied together to get the size of the memcpy().
>> Unfortunetly, the correct size are either of these variables before
>> they are squared. There sometimes appears to be an optimization
>> where if two variables are next to each other, they are sometimes
>> converted into a MPI_BYTE where the size is also incorrectly taking
>> these squared values into consideration.
>>
>> I wrote a small test program to illustrate the problem and attached
>> it to this email. First, I configured openmpi 1.3.2 with the
>> following options:
>>
>> ./configure --prefix=/myworkingdirectory/openmpi-1.3.2.local --
>> disable-mpi-f77 --disable-mpi-f90 --enable-debug --enable-mem-debug
>> --enable-mem-profile
>>
>> I then modified datatype_pack.h & datatype_unpack.h located in
>> openmpi-1.3.2/ompi/datatype directory in order to produce
>> additional debugging output. The new versions are attached to this
>> email.
>>
>> Then, I executed "make all install"
>>
>> Then, I write the attached test.c program. You can find the output
>> below. The problems appear in red.
>>
>> 0: sizes '3' '4' '8' '2'
>> 0: offsets '0' '4' '8' '16'
>> 0: addresses '134510640' '134510644' '134510648' '134510656'
>> 0: name='MPI_CHAR' _copy_blength='3' orig_copy_blength='1'
>> _copy_count='3' _source='134510640'
>> 0: name='MPI_LONG' _copy_blength='16' orig_copy_blength='4'
>> _copy_count='4' _source='134510644'
>> 0: name='MPI_DOUBLE' _copy_blength='64' orig_copy_blength='8'
>> _copy_count='8' _source='134510648'
>> 0: name='MPI_SHORT' _copy_blength='4' orig_copy_blength='2'
>> _copy_count='2' _source='134510656'
>> 0: one='22' two='222' three='33.300000' four='44'
>> 1: sizes '3' '4' '8' '2'
>> 1: offsets '0' '4' '8' '16'
>> 1: addresses '134510640' '134510644' '134510648' '134510656'
>> 1: name='MPI_CHAR' _copy_blength='3' orig_copy_blength='1'
>> _copy_count='3' _destination='134510640'
>> 1: name='MPI_LONG' _copy_blength='16' orig_copy_blength='4'
>> _copy_count='4' _destination='134510644'
>> 1: name='MPI_DOUBLE' _copy_blength='64' orig_copy_blength='8'
>> _copy_count='8' _destination='134510648'
>> 1: name='MPI_SHORT' _copy_blength='4' orig_copy_blength='2'
>> _copy_count='2' _destination='134510656'
>> 1: one='22' two='222' three='33.300000' four='44'
>>
>> You can see from the output that the MPI_Send & MPI_Recv functions
>> are reading or writing too much data from my structure, causing an
>> overflow condition to occur. I believe this is causing my
>> application to crash.
>>
>> Any help on this problem would be appreciated. If there is
>> anything else that you need from me, just let me know.
>>
>> Thanks,
>> Brian
>>
>>
>>
>>
>> On Tue, Apr 28, 2009 at 9:58 PM, Brian Blank <brianblank_at_[hidden]>
>> wrote:
>> To Whom This May Concern:
>>
>> I am having problems with an OpenMPI application I am writing on
>> the Solaris/Intel AMD64 platform. I am using OpenMPI 1.3.2 which I
>> compiled myself using the Solaris C/C++ compiler.
>>
>> My application was crashing (signal 11) inside a call to malloc()
>> which my code was running. I thought it might be a memory overflow
>> error that was causing this, so I fired up Purify. Purify found
>> several problems inside the the OpenMPI library. I think one of
>> the errors is serious and might be causing the problems I was
>> looking for.
>>
>> The serious error is an Array Bounds Write (ABW) which occurred 824
>> times through 312 calls to MPI_Recv(). This error means that
>> something inside the OpenMPI library is writing to memory where it
>> shouldn't be writing to. Here is the stack trace at the time of
>> this error:
>>
>> Stack Trace 1 (Occurred 596 times)
>>
>> memcpy rtlib.o
>> unpack_predefined_data [datatype_unpack.h:41]
>> MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
>> ompi_generic_simple_unpack [datatype_unpack.c:419]
>> ompi_convertor_unpack [convertor.c:314]
>> mca_pml_ob1_recv_frag_callback_match [pml_ob1_recvfrag.c:218]
>> mca_btl_sm_component_progress [btl_sm_component.c:427]
>> opal_progress [opal_progress.c:207]
>> opal_condition_wait [condition.h:99]
>> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768
>> illegal).>
>> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0
>> of 664 bytes.>
>>
>> Stack Trace 2 (Occurred 228 times)
>>
>> memcpy rtlib.o
>> unpack_predefined_data [datatype_unpack.h:41]
>> MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
>> ompi_generic_simple_unpack [datatype_unpack.c:419]
>> ompi_convertor_unpack [convertor.c:314]
>> mca_pml_ob1_recv_request_progress_match [pml_ob1_recvreq.c:624]
>> mca_pml_ob1_Recv_req_start [pml_ob1_recvreq.c:1008]
>> mca_pml_ob1_recv [pml_ob1_irecv.c:103]
>> MPI_Recv [precv.c:75]
>> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768
>> illegal).>
>> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0
>> of 664 bytes.>
>>
>>
>> I'm not that familiar with the under workings of the openmpi
>> library, but I tried to debug it anyway. I noticed that it was
>> copying a lot of extra bytes for MPI_LONG and MPI_DOUBLE types. On
>> my system, MPI_LONG should have been 4 bytes, but was copying 16
>> bytes. Also, MPI_DOUBLE should have been 8 bytes, but was copying
>> 64 bytes. It seems the _copy_blength variable was being set to
>> high, but I'm not sure why. The above error also shows 64 bytes
>> being read, where my debugging shows a 64 byte copy for all
>> MPI_DOUBLE types, which I feel should have been 8 bytes.
>> Therefore, I really believe _copy_blength is being set to high.
>>
>>
>> Interestingly enough, the call to MPI_Isend() was generating an ABR
>> (Array Bounds Read) error in the exact same line of code. The ABR
>> error sometimes can be fatal if the bytes being read is not a legal
>> address, but the ABW error is usually a much more fatal error
>> because it is definitely writing into memory that is probably used
>> for something else. I'm sure that if we fix the ABW error, the ABR
>> error should fix itself too as it's the same line of code.
>>
>> Purify also found 14 UMR (Uninitialized memory read) errors inside
>> the OpenMPI library. Sometimes this can be really bad if there are
>> any decisions being made as a result of reading this memory
>> location. But for now, let's solve the serious error I reported
>> above first, then I will send you the UMR errors next.
>>
>> Any help you can provide would be greatly appreciated.
>>
>> Thanks,
>> Brian
>>
>>
>>
>> <datatype_pack.
>> h><
>> datatype_unpack.
>> h><test.c>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel