Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Possible buffer overrun bug in opal_free_list_grow, called by MPI::Init
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-10-17 09:34:59


Stephan,

You might be right. intptr_t is a signed type, which allows the result
of % to be potentially negative. However, on the other side, mod is
defined as a size_t which [based on my memory] is definitively
unsigned as it represent a size.

Did you try to apply your patch to Open MPI ? If yes does it resolve
the issue ?

   george.

On Oct 16, 2008, at 7:29 PM, Stephan Kramer wrote:

> George Bosilca wrote:
>> I did investigate this issue for about 3 hours yesterday. Neither
>> valgrind nor efence report any errors on my cluster. I'm using
>> debian unstable with gcc-4.1.2. Adding printfs doesn't shows the
>> same output as you, all addresses are in the correct range. I went
>> over the code manually, and to be honest I cannot imagine how this
>> might happens IF the compiler is doing what it is supposed to do.
>>
>> I run out of options on this one. If you can debug it and figure
>> out what's the problem there I'll be happy to hear.
>>
>> george.
> Hi George,
>
> Thanks a lot for your effort of looking into this. I think I've come
> a bit further with this. The reproducibility may in fact have to do
> with 32bit/64 bit differences.
> I think the culprit is line 105 of opal_free_list.c:
>
> mod = (intptr_t)ptr % CACHE_LINE_SIZE;
> if(mod != 0) {
> ptr += (CACHE_LINE_SIZE - mod);
> }
>
> As intptr_t casts to a signed integer, for 32 bit with addresses
> above 0x7fff ffff the outcome of mod will be negative. Thus ptr will
> be increased with more than CACHE_LINE_SIZE, which is not accounted
> for in the allocated buffer size in line 93, and a buffer overrun
> will appear in the subsequent element loop. This is confirmed with
> the output of some debugging statements I've pasted below. Also I
> haven't come across the same bug on 64bit machines.
>
> I guess this should be uintptr_t instead?
>
> Cheers
> Stephan Kramer
>
> The debugging output:
>
> mpidebug: num_elements = 1, flist->fl_elem_size = 40
> mpidebug: sizeof(opal_list_item_t) = 16
> mpidebug: allocating 184
> mpidebug: allocated at memory address 0xb2d29f48
> mpidebug: mod = -40, CACHE_LINE_SIZE = 128
>
> and at point of the buffer overrun/efence segfault in gdb:
>
> (gdb) print item
> $23 = (opal_free_list_item_t *) 0xb2d2a000
>
> which is exactly at (over) the end of the buffer:
> 0xb2d2a000=0xb2d29f48 + 184
>
>>
>> On Oct 14, 2008, at 11:03 AM, Stephan Kramer wrote:
>>
>>> Would someone mind having another look at the bug reported below?
>>> I'm hitting exactly the same problem with Debian Unstable,
>>> openmpi 1.2.7~rc2. Both valgrind and efence are indispensable
>>> tools for any developper, where both may catch errors the other
>>> won't report. Electric fence is especially good at catching buffer
>>> overruns as it protects the beginning and end of each allocated
>>> buffer. The original bug report shows an undeniable buffer overrun
>>> in MPI::Init, i.e. the attached patch prints out exactly the
>>> address it's trying to access which is over the end of the buffer.
>>> Any help would be much appreciated
>>>
>>> Stephan Kramer
>>>
>>>
>>>> Patrick,
>>>>
>>>> I'm unable to reproduce the buffer overrun with the latest trunk. I
>>>> run valgrind (with the memchekcer tool) on a regular basis on the
>>>> trunk, and I never noticed anything like that. Moreover, I went
>>>> over
>>>> the code, and I cannot imagine how we can overrun the buffer in the
>>>> code you pinpointed.
>>>>
>>>> Thanks,
>>>> george.
>>>>
>>>> On Aug 23, 2008, at 7:57 PM, Patrick Farrell wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > I think I have found a buffer overrun in a function
>>>> > called by MPI::Init, though explanations of why I am
>>>> > wrong are welcome.
>>>> >
>>>> > I am using the openmpi included in Ubuntu Hardy,
>>>> > version 1.2.5, though I have inspected the latest trunk by eye
>>>> > and I don't believe the relevant code has changed.
>>>> >
>>>> > I was trying to use Electric Fence, a memory debugging library,
>>>> > to debug a suspected buffer overrun in my own program.
>>>> > Electric Fence works by replacing malloc/free in such
>>>> > a way that bounds violation errors issue a segfault.
>>>> > While running my program under Electric Fence, I found
>>>> > that I got a segfault issued at:
>>>> >
>>>> > 0xb5cdd334 in opal_free_list_grow (flist=0xb2b46a50,
>>>> num_elements=1)
>>>> > at class/opal_free_list.c:113
>>>> > 113 OBJ_CONSTRUCT_INTERNAL(item, flist->fl_elem_class);
>>>> > (gdb) bt
>>>> > #0 0xb5cdd334 in opal_free_list_grow (flist=0xb2b46a50,
>>>> > num_elements=1) at class/opal_free_list.c:113
>>>> > #1 0xb5cdd479 in opal_free_list_init (flist=0xb2b46a50,
>>>> > elem_size=56, elem_class=0xb2b46e20, num_elements_to_alloc=73,
>>>> > max_elements_to_alloc=-1, num_elements_per_alloc=1) at class/
>>>> > opal_free_list.c:78
>>>> > #2 0xb2b381aa in ompi_osc_pt2pt_component_init
>>>> > (enable_progress_threads=false, enable_mpi_threads=false) at
>>>> > osc_pt2pt_component.c:173
>>>> > #3 0xb792b67c in ompi_osc_base_find_available
>>>> > (enable_progress_threads=false, enable_mpi_threads=false) at
>>>> base/
>>>> > osc_base_open.c:84
>>>> > #4 0xb78e6abe in ompi_mpi_init (argc=5, argv=0xbfd61f84,
>>>> > requested=0, provided=0xbfd61e78) at runtime/ompi_mpi_init.c:411
>>>> > #5 0xb7911a87 in PMPI_Init (argc=0xbfd61f00, argv=0xbfd61f04) at
>>>> > pinit.c:71
>>>> > #6 0x0811ca6c in MPI::Init ()
>>>> > #7 0x08118b8a in main ()
>>>> >
>>>> > To investigate further, I replaced the OBJ_CONSTRUCT_INTERNAL
>>>> > macro with its definition in opal/class/opal_object.h, and ran it
>>>> > again.
>>>> > It appears that the invalid memory access is happening
>>>> > on the instruction
>>>> >
>>>> > ((opal_object_t *) (item))->obj_class = (flist->fl_elem_class);
>>>> >
>>>> > Investigating further, I modified the source to opal_free_list
>>>> > with the attached patch. It adds a few debugging printfs to
>>>> > diagnose exactly what the code is doing. The output of the
>>>> debugging
>>>> > statements are:
>>>> >
>>>> > mpidebug: allocating 216
>>>> > mpidebug: allocated at memory address 0xb62bdf28
>>>> > mpidebug: accessing address 0xb62be000
>>>> > [segfault]
>>>> >
>>>> > Now, 0xb62be000 - 0xb62bdf28 = 216, which is
>>>> > the size of the buffer allocated, and so I think
>>>> > this is a buffer overrun.
>>>> >
>>>> > Steps to reproduce:
>>>> >
>>>> > a) Install Electric Fence
>>>> > b) Compile the following program
>>>> >
>>>> > #include <stdlib.h>
>>>> > #include <unistd.h>
>>>> >
>>>> > #include <mpi.h>
>>>> >
>>>> > int main(int argc, char **argv)
>>>> > {
>>>> > MPI::Init(argc, argv);
>>>> > MPI::Finalize();
>>>> >
>>>> > return 0;
>>>> > }
>>>> >
>>>> > with
>>>> >
>>>> > mpiCC -o test ./test.cpp
>>>> >
>>>> > c) gdb ./test
>>>> > d) set environment LD_PRELOAD /usr/lib/libefence.so.0.0
>>>> > e) run
>>>> >
>>>> > Hope this helps,
>>>> >
>>>> > Patrick Farrell
>>>> >
>>>> > --
>>>> > Patrick Farrell
>>>> > PhD student
>>>> > Imperial College London
>>>> > --- openmpi-1.2.5/opal/class/opal_free_list.c 2008-08-23
>>>> > 18:35:03.000000000 +0100
>>>> > +++ openmpi-1.2.5-modified/opal/class/opal_free_list.c 2008-08-23
>>>> > 18:31:47.000000000 +0100
>>>> > @@ -90,9 +90,12 @@
>>>> > if (flist->fl_max_to_alloc > 0 && flist->fl_num_allocated +
>>>> > num_elements > flist->fl_max_to_alloc)
>>>> > return OPAL_ERR_TEMP_OUT_OF_RESOURCE;
>>>> >
>>>> > + fprintf(stderr, "mpidebug: allocating %d\n", (num_elements *
>>>> > flist->fl_elem_size) + sizeof(opal_list_item_t) +
>>>> CACHE_LINE_SIZE);
>>>> > alloc_ptr = (unsigned char *)malloc((num_elements * flist-
>>>> > >fl_elem_size) +
>>>> > sizeof(opal_list_item_t) +
>>>> > CACHE_LINE_SIZE);
>>>> > + fprintf(stderr, "mpidebug: allocated at memory address %p\n",
>>>> > alloc_ptr);
>>>> > +
>>>> > if(NULL == alloc_ptr)
>>>> > return OPAL_ERR_TEMP_OUT_OF_RESOURCE;
>>>> >
>>>> > @@ -110,7 +113,16 @@
>>>> > for(i=0; i<num_elements; i++) {
>>>> > opal_free_list_item_t* item = (opal_free_list_item_t*)ptr;
>>>> > if (NULL != flist->fl_elem_class) {
>>>> > - OBJ_CONSTRUCT_INTERNAL(item, flist->fl_elem_class);
>>>> > + do {
>>>> > + if (0 == (flist->fl_elem_class)->cls_initialized) {
>>>> > + opal_class_initialize((flist->fl_elem_class));
>>>> > + }
>>>> > + fprintf(stderr, "mpidebug: accessing address %p\n",
>>>> > &((opal_object_t *) (item))->obj_class);
>>>> > + ((opal_object_t *) (item))->obj_class = (flist-
>>>> > >fl_elem_class);
>>>> > + fprintf(stderr, "mpidebug: accessing address %p\n",
>>>> > &((opal_object_t *) (item))->obj_reference_count);
>>>> > + ((opal_object_t *) (item))->obj_reference_count = 1;
>>>> > + opal_obj_run_constructors((opal_object_t *) (item));
>>>> > + } while (0);
>>>> > }
>>>> > opal_list_append(&(flist->super), &(item->super));
>>>> > ptr += flist->fl_elem_size;
>>>> > @@ -119,5 +131,3 @@
>>>> > return OPAL_SUCCESS;
>>>> > }
>>>> >
>>>> > -
>>>> > -
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > devel_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s