Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Tim Prins (tprins_at_[hidden])
Date: 2007-09-21 23:10:42


But I am compiling Open MPI with --without-memory-manager, so it should work?

Anyways, I ran the tests and valgrind is reporting 2 different (potentially
related) problems:

1.
==12680== Invalid read of size 4
==12680== at 0x709DE03: ompi_cb_fifo_write_to_head
(ompi_circular_buffer_fifo.h:271)
==12680== by 0x709DA77: ompi_fifo_write_to_head (ompi_fifo.h:324)
==12680== by 0x709D964: mca_btl_sm_component_progress
(btl_sm_component.c:398)
==12680== by 0x705BF6B: mca_bml_r2_progress (bml_r2.c:110)
==12680== by 0x44F905B: opal_progress (opal_progress.c:187)
==12680== by 0x704F0E5: opal_condition_wait (condition.h:98)
==12680== by 0x704EFD4: mca_pml_ob1_recv (pml_ob1_irecv.c:124)
==12680== by 0x7202A62: ompi_coll_tuned_scatter_intra_binomial
(coll_tuned_scatter.c:166)
==12680== by 0x71F2C08: ompi_coll_tuned_scatter_intra_dec_fixed
(coll_tuned_decision_fixed.c:746)
==12680== by 0x4442494: PMPI_Scatter (pscatter.c:125)
==12680== by 0x8048F6F: main (scatter_in_place.c:73)

2.
==28775== Jump to the invalid address stated on the next line
==28775== at 0x2F305F35: ???
==28775== by 0x704AF6B: mca_bml_r2_progress (bml_r2.c:110)
==28775== by 0x44F905B: opal_progress (opal_progress.c:187)
==28775== by 0x440BF6B: opal_condition_wait (condition.h:98)
==28775== by 0x440BDF7: ompi_request_wait (req_wait.c:46)
==28775== by 0x71EF396:
ompi_coll_tuned_reduce_scatter_intra_basic_recursivehalving
(coll_tuned_reduce_scatter.c:319)
==28775== by 0x71E1540: ompi_coll_tuned_reduce_scatter_intra_dec_fixed
(coll_tuned_decision_fixed.c:471)
==28775== by 0x7202806: ompi_osc_pt2pt_module_fence (osc_pt2pt_sync.c:84)
==28775== by 0x44501B5: PMPI_Win_fence (pwin_fence.c:57)
==28775== by 0x80493D6: test_acc3_1 (test_acc3.c:156)
==28775== by 0x8048FD0: test_acc3 (test_acc3.c:26)
==28775== by 0x8049609: main (test_acc3.c:206)
==28775== Address 0x2F305F35 is not stack'd, malloc'd or (recently) free'd

I don't know what to make of these. Here is the link to the full results:
http://www.open-mpi.org/mtt/index.php?do_redir=386

Thanks,

Tim

On Friday 21 September 2007 10:40:21 am George Bosilca wrote:
> Tim,
>
> Valgrind will not help ... It can help with double free or things
> like this, but not with over-running memory that belong to your
> application. However, in Open MPI we have something that might help
> you. The option --enable-mem-debug add a unused space at the end of
> each memory allocation and make sure we don't write anything there. I
> think this is the simplest way to pinpoint this problem.
>
> Thanks,
> george.
>
> On Sep 21, 2007, at 10:07 AM, Tim Prins wrote:
> > Aurelien and Brian.
> >
> > Thanks for the suggestions. I reran the runs with --without-memory-
> > manager and
> > got (on 2 of 5000 runs):
> > *** glibc detected *** corrupted double-linked list: 0xf704dff8 ***
> > on one and
> > *** glibc detected *** malloc(): memory corruption: 0xeda00c70 ***
> > on the other.
> >
> > So it looks like somewhere we are over-running our allocated space.
> > So now I
> > am attempting to redo the run with valgrind.
> >
> > Tim
> >
> > On Thursday 20 September 2007 09:59:14 pm Brian Barrett wrote:
> >> On Sep 20, 2007, at 7:02 AM, Tim Prins wrote:
> >>> In our nightly runs with the trunk I have started seeing cases
> >>> where we
> >>> appear to be segfaulting within/below malloc. Below is a typical
> >>> output.
> >>>
> >>> Note that this appears to only happen on the trunk, when we use
> >>> openib,
> >>> and are in 32 bit mode. It seems to happen randomly at a very low
> >>> frequency (59 out of about 60,000 32 bit openib runs).
> >>>
> >>> This could be a problem with our machine, and has showed up since I
> >>> started testing 32bit ofed 10 days ago.
> >>>
> >>> Anyways, just curious if anyone had any ideas.
> >>
> >> As someone else said, this usually points to a duplicate free or the
> >> like in malloc. You might want to try compiling with --without-
> >> memory-manager, as the ptmalloc2 in glibc frequently is more verbose
> >> about where errors occurred than is the one in Open MPI.
> >>
> >> Brian
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel