Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] [Fwd: Re: Fwd: Purify found bugs inside open-mpi library]
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-05-02 07:27:47


Forwarding the following message since it contains Brian Blanks purify
results since I think most of them are out of my purview and I am also
on vacation now :-).

--td


attached mail follows:


Hi Terry,

I did a memset() prior to the call to processor_info(), and the UMR went
away. I further tested by setting pinfo.pi_state to -1 prior to the call to
processor_info(), and processor_info() always sets pinfo.pi_state to 2.
Therefore, I am starting to suspect this is a bug in purify. Maybe purify
is having issues detecting that this variable was updated by system code.
I'm going to forward a sample program to the IBM purify team to have them
investigate further.

I also attached a copy of mpi_purify.txt that contains all the purify
findings. Therefore a handful of UMR errors that occur through different
call stacks. Also, there are 2 file descriptors left open & a lot of memory
that leaked despite me calling MPI_Finalize().

Let me know if you need me to try something else or to produce any
additional output.

Thanks again,
Brian

On Thu, Apr 30, 2009 at 10:11 AM, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:

> So I've been kibitzing with Jeff on the below. If you do a memset of pinfo
> prior to the line you show below does the UMR go away. I believe it will
> not and that you probably will need to do something like pinfo.pi_state = 0.
> Can you try this out for me?
> Thanks,
>
> --td
>
>
> Brian Blank wrote:
>
>> Hi Jeff,
>>
>> That definetly worked for me. Thanks so much for you help.
>>
>> Purify did find some other UMR (unitialize memory read) errors though, but
>> they don't seem to be negativley impacting my application right now.
>> Nonetheless, I'll post them later today in case anyone is interested in
>> them.
>>
>> Just to give you a sample of what it see's now, one of the UMR errors
>> seems a little odd ... paffinity_solaris_module.c line 180.
>> if (P_ONLINE == pinfo.pi_state || P_NOINTR == pinfo.pi_state) {
>>
>> I'll post the rest of the UMR errors later today.
>>
>> Thanks again,
>> Brian
>>
>> On Apr 29, 2009, at 4:06 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>
>> Actually, I think your program is erroneous -- it looks like you're using
>>> number of bytes for the sizes[] array when it really should be using number
>>> of elements. Specifically, it should be:
>>>
>>> sizes[0] = (int) sizeof(tstruct.one);
>>> sizes[1] = 1;
>>> sizes[2] = 1;
>>> sizes[3] = 1;
>>>
>>> Since MPI knows the sizes of datatypes, you specify counts of datatypes
>>> -- not numbers of bytes.
>>>
>>> That seemed to make your program work for me; double check and ensure
>>> that it works for you.
>>>
>>>
>>> On Apr 29, 2009, at 1:21 PM, Brian Blank wrote:
>>>
>>> To Whom This May Concern:
>>>>
>>>> I originally sent this to the users list, but realizing now that this
>>>> might be more appropriate for the developer's list as it is dealing with
>>>> issues internal to the openmpi library (sorry for the dual distribution).
>>>> Please start with second email first.
>>>>
>>>> Thanks,
>>>> Brian Blank
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Brian Blank <brianblank_at_[hidden]>
>>>> Date: Wed, Apr 29, 2009 at 1:09 PM
>>>> Subject: Re: Purify found bugs inside open-mpi library
>>>> To: users_at_[hidden]
>>>>
>>>>
>>>> To Whom This May Concern:
>>>>
>>>> I've been trying to dig a little deeper into this problem and found some
>>>> additional information.
>>>>
>>>> First, the stack trace for the ABR and ABW were different. The ABR
>>>> problem occurred in datatype_pack.h while the ABW problem occurred in
>>>> datatype_unpack.h. The problem appears to be the same still. Both errors
>>>> are occurring during a call to MEMCPY_CSUM().
>>>>
>>>> I also found there are two different variables playing into this bug.
>>>> There is _copy_blength and _copy_count. At the top of the function, both
>>>> of these variables are set to 2 bytes for MPI_SHORT, 4 bytes for MPI_LONG,
>>>> and 8 bytes for MPI_DOUBLE. Then, these variables are multiplied together
>>>> to get the size of the memcpy(). Unfortunetly, the correct size are either
>>>> of these variables before they are squared. There sometimes appears to be
>>>> an optimization where if two variables are next to each other, they are
>>>> sometimes converted into a MPI_BYTE where the size is also incorrectly
>>>> taking these squared values into consideration.
>>>>
>>>> I wrote a small test program to illustrate the problem and attached it
>>>> to this email. First, I configured openmpi 1.3.2 with the following
>>>> options:
>>>>
>>>> ./configure --prefix=/myworkingdirectory/openmpi-1.3.2.local
>>>> --disable-mpi-f77 --disable-mpi-f90 --enable-debug --enable-mem-debug
>>>> --enable-mem-profile
>>>>
>>>> I then modified datatype_pack.h & datatype_unpack.h located in
>>>> openmpi-1.3.2/ompi/datatype directory in order to produce additional
>>>> debugging output. The new versions are attached to this email.
>>>>
>>>> Then, I executed "make all install"
>>>>
>>>> Then, I write the attached test.c program. You can find the output
>>>> below. The problems appear in red.
>>>>
>>>> 0: sizes '3' '4' '8' '2'
>>>> 0: offsets '0' '4' '8' '16'
>>>> 0: addresses '134510640' '134510644' '134510648' '134510656'
>>>> 0: name='MPI_CHAR' _copy_blength='3' orig_copy_blength='1'
>>>> _copy_count='3' _source='134510640'
>>>> 0: name='MPI_LONG' _copy_blength='16' orig_copy_blength='4'
>>>> _copy_count='4' _source='134510644'
>>>> 0: name='MPI_DOUBLE' _copy_blength='64' orig_copy_blength='8'
>>>> _copy_count='8' _source='134510648'
>>>> 0: name='MPI_SHORT' _copy_blength='4' orig_copy_blength='2'
>>>> _copy_count='2' _source='134510656'
>>>> 0: one='22' two='222' three='33.300000' four='44'
>>>> 1: sizes '3' '4' '8' '2'
>>>> 1: offsets '0' '4' '8' '16'
>>>> 1: addresses '134510640' '134510644' '134510648' '134510656'
>>>> 1: name='MPI_CHAR' _copy_blength='3' orig_copy_blength='1'
>>>> _copy_count='3' _destination='134510640'
>>>> 1: name='MPI_LONG' _copy_blength='16' orig_copy_blength='4'
>>>> _copy_count='4' _destination='134510644'
>>>> 1: name='MPI_DOUBLE' _copy_blength='64' orig_copy_blength='8'
>>>> _copy_count='8' _destination='134510648'
>>>> 1: name='MPI_SHORT' _copy_blength='4' orig_copy_blength='2'
>>>> _copy_count='2' _destination='134510656'
>>>> 1: one='22' two='222' three='33.300000' four='44'
>>>>
>>>> You can see from the output that the MPI_Send & MPI_Recv functions are
>>>> reading or writing too much data from my structure, causing an overflow
>>>> condition to occur. I believe this is causing my application to crash.
>>>>
>>>> Any help on this problem would be appreciated. If there is anything
>>>> else that you need from me, just let me know.
>>>>
>>>> Thanks,
>>>> Brian
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Apr 28, 2009 at 9:58 PM, Brian Blank <brianblank_at_[hidden]>
>>>> wrote:
>>>> To Whom This May Concern:
>>>>
>>>> I am having problems with an OpenMPI application I am writing on the
>>>> Solaris/Intel AMD64 platform. I am using OpenMPI 1.3.2 which I compiled
>>>> myself using the Solaris C/C++ compiler.
>>>>
>>>> My application was crashing (signal 11) inside a call to malloc() which
>>>> my code was running. I thought it might be a memory overflow error that was
>>>> causing this, so I fired up Purify. Purify found several problems inside
>>>> the the OpenMPI library. I think one of the errors is serious and might be
>>>> causing the problems I was looking for.
>>>>
>>>> The serious error is an Array Bounds Write (ABW) which occurred 824
>>>> times through 312 calls to MPI_Recv(). This error means that something
>>>> inside the OpenMPI library is writing to memory where it shouldn't be
>>>> writing to. Here is the stack trace at the time of this error:
>>>>
>>>> Stack Trace 1 (Occurred 596 times)
>>>>
>>>> memcpy rtlib.o
>>>> unpack_predefined_data [datatype_unpack.h:41]
>>>> MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
>>>> ompi_generic_simple_unpack [datatype_unpack.c:419]
>>>> ompi_convertor_unpack [convertor.c:314]
>>>> mca_pml_ob1_recv_frag_callback_match [pml_ob1_recvfrag.c:218]
>>>> mca_btl_sm_component_progress [btl_sm_component.c:427]
>>>> opal_progress [opal_progress.c:207]
>>>> opal_condition_wait [condition.h:99]
>>>> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768 illegal).>
>>>> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0 of
>>>> 664 bytes.>
>>>>
>>>> Stack Trace 2 (Occurred 228 times)
>>>>
>>>> memcpy rtlib.o
>>>> unpack_predefined_data [datatype_unpack.h:41]
>>>> MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
>>>> ompi_generic_simple_unpack [datatype_unpack.c:419]
>>>> ompi_convertor_unpack [convertor.c:314]
>>>> mca_pml_ob1_recv_request_progress_match [pml_ob1_recvreq.c:624]
>>>> mca_pml_ob1_Recv_req_start [pml_ob1_recvreq.c:1008]
>>>> mca_pml_ob1_recv [pml_ob1_irecv.c:103]
>>>> MPI_Recv [precv.c:75]
>>>> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768 illegal).>
>>>> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0 of
>>>> 664 bytes.>
>>>>
>>>>
>>>> I'm not that familiar with the under workings of the openmpi library,
>>>> but I tried to debug it anyway. I noticed that it was copying a lot of
>>>> extra bytes for MPI_LONG and MPI_DOUBLE types. On my system, MPI_LONG
>>>> should have been 4 bytes, but was copying 16 bytes. Also, MPI_DOUBLE should
>>>> have been 8 bytes, but was copying 64 bytes. It seems the _copy_blength
>>>> variable was being set to high, but I'm not sure why. The above error also
>>>> shows 64 bytes being read, where my debugging shows a 64 byte copy for all
>>>> MPI_DOUBLE types, which I feel should have been 8 bytes. Therefore, I
>>>> really believe _copy_blength is being set to high.
>>>>
>>>>
>>>> Interestingly enough, the call to MPI_Isend() was generating an ABR
>>>> (Array Bounds Read) error in the exact same line of code. The ABR error
>>>> sometimes can be fatal if the bytes being read is not a legal address, but
>>>> the ABW error is usually a much more fatal error because it is definitely
>>>> writing into memory that is probably used for something else. I'm sure that
>>>> if we fix the ABW error, the ABR error should fix itself too as it's the
>>>> same line of code.
>>>>
>>>> Purify also found 14 UMR (Uninitialized memory read) errors inside the
>>>> OpenMPI library. Sometimes this can be really bad if there are any
>>>> decisions being made as a result of reading this memory location. But for
>>>> now, let's solve the serious error I reported above first, then I will send
>>>> you the UMR errors next.
>>>>
>>>> Any help you can provide would be greatly appreciated.
>>>>
>>>> Thanks,
>>>> Brian
>>>>
>>>>
>>>>
>>>> <datatype_pack.h><datatype_unpack.h><test.c>_______________________________________________
>>>>
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>