Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfault when using valgrind
From: Justin Luitjens (luitjens_at_[hidden])
Date: 2009-07-09 11:08:42


I was able to get rid of the segfaults/invalid reads by disabling the
shared memory path. They still reported an error with uninitialized memory
in the same spot which I believe is due to the struct being padded for
alignment. I added a supression and was able to get past this part just
fine.

Thanks,
Justin

On Thu, Jul 9, 2009 at 5:16 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> On Jul 7, 2009, at 11:47 AM, Justin wrote:
>
> (Sorry if this is posted twice, I sent the same email yesterday but it
>> never appeared on the list).
>>
>>
> Sorry for the delay in replying. FWIW, I got your original message as
> well.
>
> Hi, I am attempting to debug a memory corruption in an mpi program
>> using valgrind. However, when I run with valgrind I get semi-random
>> segfaults and valgrind messages with the openmpi library. Here is an
>> example of such a seg fault:
>>
>> ==6153==
>> ==6153== Invalid read of size 8
>> ==6153== at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
>> mca_btl_sm.so)
>>
>> ...
>
>> ==6153== Address 0x10 is not stack'd, malloc'd or (recently) free'd
>> ^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
>> (segmentation violation)
>>
>> Looking at the code for our isend at SFC.h:298 does not seem to have any
>> errors:
>>
>> =============================================
>> MergeInfo<BITS> myinfo,theirinfo;
>>
>> MPI_Request srequest, rrequest;
>> MPI_Status status;
>>
>> myinfo.n=n;
>> if(n!=0)
>> {
>> myinfo.min=sendbuf[0].bits;
>> myinfo.max=sendbuf[n-1].bits;
>> }
>> //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:"
>> << (int)myinfo.max << endl;
>>
>> MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest);
>> ==============================================
>>
>> myinfo is a struct located on the stack, to is the rank of the processor
>> that the message is being sent to, and srequest is also on the stack.
>> In addition this message is waited on prior to exiting this block of
>> code so they still exist on the stack. When I don't run with valgrind
>> my program runs past this point just fine.
>>
>>
> Strange. I can't think of an immediate reason as to why this would happen
> -- does it also happen if you use a blocking send (vs. an Isend)? Is myinfo
> a complex object, or a variable-length object?
>
>
> I am currently using openmpi 1.3 from the debian unstable branch. I
>> also see the same type of segfault in a different portion of the code
>> involving an MPI_Allgather which can be seen below:
>>
>> ==============================================
>> ==22736== Use of uninitialised value of size 8
>> ==22736== at 0x19104775: mca_btl_sm_component_progress
>> (opal_list.h:322)
>> ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207)
>> ==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99)
>> ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
>> (coll_tuned_util.c:55)
>> ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
>> (coll_tuned_util.h:60)
>> ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
>> ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
>> ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
>> ==22736== by 0x6465457:
>> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
>> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
>> ==22736== by 0x8345759: Uintah::SimulationController::gridSetup()
>> (SimulationController.cc:243)
>> ==22736== by 0x834F418: Uintah::AMRSimulationController::run()
>> (AMRSimulationController.cc:117)
>> ==22736== by 0x4089AE: main (sus.cc:629)
>> ==22736==
>> ==22736== Invalid read of size 8
>> ==22736== at 0x19104775: mca_btl_sm_component_progress
>> (opal_list.h:322)
>> ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207)
>> ==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99)
>> ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
>> (coll_tuned_util.c:55)
>> ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
>> (coll_tuned_util.h:60)
>> ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
>> ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
>> ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
>> ==22736== by 0x6465457:
>> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
>> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
>> ==22736== by 0x8345759: Uintah::SimulationController::gridSetup()
>> (SimulationController.cc:243)
>> ==22736== by 0x834F418: Uintah::AMRSimulationController::run()
>> (AMRSimulationController.cc:117)
>> ==22736== by 0x4089AE: main (sus.cc:629)
>> ================================================================
>>
>> Are these problems with openmpi and is there any known work arounds?
>>
>>
>
> These are new to me. The problem does seem to occur with OMPI's shared
> memory device; you might want to try a different point-to-point device
> (e.g., tcp?) to see if the problem goes away. But be aware that the problem
> "going away" does not really pinpoint the location of the problem -- moving
> to a slower transport (like tcp) may simply change timing such that the
> problem does not occur. I.e., the problem could still exist in either your
> code or OMPI -- this would simply be a workaround.
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>