Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Possible buffer overrun bug in opal_free_list_grow, called by MPI::Init
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-10-15 14:12:25


I did investigate this issue for about 3 hours yesterday. Neither
valgrind nor efence report any errors on my cluster. I'm using debian
unstable with gcc-4.1.2. Adding printfs doesn't shows the same output
as you, all addresses are in the correct range. I went over the code
manually, and to be honest I cannot imagine how this might happens IF
the compiler is doing what it is supposed to do.

I run out of options on this one. If you can debug it and figure out
what's the problem there I'll be happy to hear.

  george.

On Oct 14, 2008, at 11:03 AM, Stephan Kramer wrote:

> Would someone mind having another look at the bug reported below?
> I'm hitting exactly the same problem with Debian Unstable, openmpi
> 1.2.7~rc2. Both valgrind and efence are indispensable tools for any
> developper, where both may catch errors the other won't report.
> Electric fence is especially good at catching buffer overruns as it
> protects the beginning and end of each allocated buffer. The
> original bug report shows an undeniable buffer overrun in MPI::Init,
> i.e. the attached patch prints out exactly the address it's trying
> to access which is over the end of the buffer. Any help would be
> much appreciated
>
> Stephan Kramer
>
>
>> Patrick,
>>
>> I'm unable to reproduce the buffer overrun with the latest trunk. I
>> run valgrind (with the memchekcer tool) on a regular basis on the
>> trunk, and I never noticed anything like that. Moreover, I went over
>> the code, and I cannot imagine how we can overrun the buffer in the
>> code you pinpointed.
>>
>> Thanks,
>> george.
>>
>> On Aug 23, 2008, at 7:57 PM, Patrick Farrell wrote:
>>
>> > Hi,
>> >
>> > I think I have found a buffer overrun in a function
>> > called by MPI::Init, though explanations of why I am
>> > wrong are welcome.
>> >
>> > I am using the openmpi included in Ubuntu Hardy,
>> > version 1.2.5, though I have inspected the latest trunk by eye
>> > and I don't believe the relevant code has changed.
>> >
>> > I was trying to use Electric Fence, a memory debugging library,
>> > to debug a suspected buffer overrun in my own program.
>> > Electric Fence works by replacing malloc/free in such
>> > a way that bounds violation errors issue a segfault.
>> > While running my program under Electric Fence, I found
>> > that I got a segfault issued at:
>> >
>> > 0xb5cdd334 in opal_free_list_grow (flist=0xb2b46a50,
>> num_elements=1)
>> > at class/opal_free_list.c:113
>> > 113 OBJ_CONSTRUCT_INTERNAL(item, flist->fl_elem_class);
>> > (gdb) bt
>> > #0 0xb5cdd334 in opal_free_list_grow (flist=0xb2b46a50,
>> > num_elements=1) at class/opal_free_list.c:113
>> > #1 0xb5cdd479 in opal_free_list_init (flist=0xb2b46a50,
>> > elem_size=56, elem_class=0xb2b46e20, num_elements_to_alloc=73,
>> > max_elements_to_alloc=-1, num_elements_per_alloc=1) at class/
>> > opal_free_list.c:78
>> > #2 0xb2b381aa in ompi_osc_pt2pt_component_init
>> > (enable_progress_threads=false, enable_mpi_threads=false) at
>> > osc_pt2pt_component.c:173
>> > #3 0xb792b67c in ompi_osc_base_find_available
>> > (enable_progress_threads=false, enable_mpi_threads=false) at base/
>> > osc_base_open.c:84
>> > #4 0xb78e6abe in ompi_mpi_init (argc=5, argv=0xbfd61f84,
>> > requested=0, provided=0xbfd61e78) at runtime/ompi_mpi_init.c:411
>> > #5 0xb7911a87 in PMPI_Init (argc=0xbfd61f00, argv=0xbfd61f04) at
>> > pinit.c:71
>> > #6 0x0811ca6c in MPI::Init ()
>> > #7 0x08118b8a in main ()
>> >
>> > To investigate further, I replaced the OBJ_CONSTRUCT_INTERNAL
>> > macro with its definition in opal/class/opal_object.h, and ran it
>> > again.
>> > It appears that the invalid memory access is happening
>> > on the instruction
>> >
>> > ((opal_object_t *) (item))->obj_class = (flist->fl_elem_class);
>> >
>> > Investigating further, I modified the source to opal_free_list
>> > with the attached patch. It adds a few debugging printfs to
>> > diagnose exactly what the code is doing. The output of the
>> debugging
>> > statements are:
>> >
>> > mpidebug: allocating 216
>> > mpidebug: allocated at memory address 0xb62bdf28
>> > mpidebug: accessing address 0xb62be000
>> > [segfault]
>> >
>> > Now, 0xb62be000 - 0xb62bdf28 = 216, which is
>> > the size of the buffer allocated, and so I think
>> > this is a buffer overrun.
>> >
>> > Steps to reproduce:
>> >
>> > a) Install Electric Fence
>> > b) Compile the following program
>> >
>> > #include <stdlib.h>
>> > #include <unistd.h>
>> >
>> > #include <mpi.h>
>> >
>> > int main(int argc, char **argv)
>> > {
>> > MPI::Init(argc, argv);
>> > MPI::Finalize();
>> >
>> > return 0;
>> > }
>> >
>> > with
>> >
>> > mpiCC -o test ./test.cpp
>> >
>> > c) gdb ./test
>> > d) set environment LD_PRELOAD /usr/lib/libefence.so.0.0
>> > e) run
>> >
>> > Hope this helps,
>> >
>> > Patrick Farrell
>> >
>> > --
>> > Patrick Farrell
>> > PhD student
>> > Imperial College London
>> > --- openmpi-1.2.5/opal/class/opal_free_list.c 2008-08-23
>> > 18:35:03.000000000 +0100
>> > +++ openmpi-1.2.5-modified/opal/class/opal_free_list.c 2008-08-23
>> > 18:31:47.000000000 +0100
>> > @@ -90,9 +90,12 @@
>> > if (flist->fl_max_to_alloc > 0 && flist->fl_num_allocated +
>> > num_elements > flist->fl_max_to_alloc)
>> > return OPAL_ERR_TEMP_OUT_OF_RESOURCE;
>> >
>> > + fprintf(stderr, "mpidebug: allocating %d\n", (num_elements *
>> > flist->fl_elem_size) + sizeof(opal_list_item_t) + CACHE_LINE_SIZE);
>> > alloc_ptr = (unsigned char *)malloc((num_elements * flist-
>> > >fl_elem_size) +
>> > sizeof(opal_list_item_t) +
>> > CACHE_LINE_SIZE);
>> > + fprintf(stderr, "mpidebug: allocated at memory address %p\n",
>> > alloc_ptr);
>> > +
>> > if(NULL == alloc_ptr)
>> > return OPAL_ERR_TEMP_OUT_OF_RESOURCE;
>> >
>> > @@ -110,7 +113,16 @@
>> > for(i=0; i<num_elements; i++) {
>> > opal_free_list_item_t* item = (opal_free_list_item_t*)ptr;
>> > if (NULL != flist->fl_elem_class) {
>> > - OBJ_CONSTRUCT_INTERNAL(item, flist->fl_elem_class);
>> > + do {
>> > + if (0 == (flist->fl_elem_class)->cls_initialized) {
>> > + opal_class_initialize((flist->fl_elem_class));
>> > + }
>> > + fprintf(stderr, "mpidebug: accessing address %p\n",
>> > &((opal_object_t *) (item))->obj_class);
>> > + ((opal_object_t *) (item))->obj_class = (flist-
>> > >fl_elem_class);
>> > + fprintf(stderr, "mpidebug: accessing address %p\n",
>> > &((opal_object_t *) (item))->obj_reference_count);
>> > + ((opal_object_t *) (item))->obj_reference_count = 1;
>> > + opal_obj_run_constructors((opal_object_t *) (item));
>> > + } while (0);
>> > }
>> > opal_list_append(&(flist->super), &(item->super));
>> > ptr += flist->fl_elem_size;
>> > @@ -119,5 +131,3 @@
>> > return OPAL_SUCCESS;
>> > }
>> >
>> > -
>> > -
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel