Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfault when using valgrind
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-09 07:16:22


On Jul 7, 2009, at 11:47 AM, Justin wrote:

> (Sorry if this is posted twice, I sent the same email yesterday but it
> never appeared on the list).
>

Sorry for the delay in replying. FWIW, I got your original message as
well.

> Hi, I am attempting to debug a memory corruption in an mpi program
> using valgrind. However, when I run with valgrind I get semi-random
> segfaults and valgrind messages with the openmpi library. Here is an
> example of such a seg fault:
>
> ==6153==
> ==6153== Invalid read of size 8
> ==6153== at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
> mca_btl_sm.so)
>
...
> ==6153== Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
> (segmentation violation)
>
> Looking at the code for our isend at SFC.h:298 does not seem to have
> any
> errors:
>
> =============================================
> MergeInfo<BITS> myinfo,theirinfo;
>
> MPI_Request srequest, rrequest;
> MPI_Status status;
>
> myinfo.n=n;
> if(n!=0)
> {
> myinfo.min=sendbuf[0].bits;
> myinfo.max=sendbuf[n-1].bits;
> }
> //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:"
> << (int)myinfo.max << endl;
>
> MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,
> 0,Comm,&srequest);
> ==============================================
>
> myinfo is a struct located on the stack, to is the rank of the
> processor
> that the message is being sent to, and srequest is also on the stack.
> In addition this message is waited on prior to exiting this block of
> code so they still exist on the stack. When I don't run with valgrind
> my program runs past this point just fine.
>

Strange. I can't think of an immediate reason as to why this would
happen -- does it also happen if you use a blocking send (vs. an
Isend)? Is myinfo a complex object, or a variable-length object?

> I am currently using openmpi 1.3 from the debian unstable branch. I
> also see the same type of segfault in a different portion of the code
> involving an MPI_Allgather which can be seen below:
>
> ==============================================
> ==22736== Use of uninitialised value of size 8
> ==22736== at 0x19104775: mca_btl_sm_component_progress
> (opal_list.h:322)
> ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207)
> ==22736== by 0xB404264: ompi_request_default_wait_all
> (condition.h:99)
> ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
> (coll_tuned_util.c:55)
> ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
> (coll_tuned_util.h:60)
> ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
> ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
> ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:
> 537)
> ==22736== by 0x6465457:
> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
> ==22736== by 0x8345759: Uintah::SimulationController::gridSetup()
> (SimulationController.cc:243)
> ==22736== by 0x834F418: Uintah::AMRSimulationController::run()
> (AMRSimulationController.cc:117)
> ==22736== by 0x4089AE: main (sus.cc:629)
> ==22736==
> ==22736== Invalid read of size 8
> ==22736== at 0x19104775: mca_btl_sm_component_progress
> (opal_list.h:322)
> ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207)
> ==22736== by 0xB404264: ompi_request_default_wait_all
> (condition.h:99)
> ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
> (coll_tuned_util.c:55)
> ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
> (coll_tuned_util.h:60)
> ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
> ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
> ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:
> 537)
> ==22736== by 0x6465457:
> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
> ==22736== by 0x8345759: Uintah::SimulationController::gridSetup()
> (SimulationController.cc:243)
> ==22736== by 0x834F418: Uintah::AMRSimulationController::run()
> (AMRSimulationController.cc:117)
> ==22736== by 0x4089AE: main (sus.cc:629)
> ================================================================
>
> Are these problems with openmpi and is there any known work arounds?
>

These are new to me. The problem does seem to occur with OMPI's
shared memory device; you might want to try a different point-to-point
device (e.g., tcp?) to see if the problem goes away. But be aware
that the problem "going away" does not really pinpoint the location of
the problem -- moving to a slower transport (like tcp) may simply
change timing such that the problem does not occur. I.e., the problem
could still exist in either your code or OMPI -- this would simply be
a workaround.

-- 
Jeff Squyres
Cisco Systems