Hi, I am attempting to debug a memory corruption in an mpi program using valgrind. Howver, when I run with valgrind I get semi-random segfaults and valgrind messages with the openmpi library. Here is an example of such a seg fault:
==6153==
==6153== Invalid read of size 8
==6153== at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so)
==6153== by 0x182ABACB: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153== by 0x182A3040: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153== by 0xB425DD3: PMPI_Isend (in /usr/lib/openmpi/lib/libmpi.so.0.0.0)
==6153== by 0x7B83DA8: int Uintah::SFC<double>::MergeExchange<unsigned char>(int, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:2989)
==6153== by 0x7B84A8F: void Uintah::SFC<double>::Batchers<unsigned char>(std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3730)
==6153== by 0x7B8857B: void Uintah::SFC<double>::Cleanup<unsigned char>(std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3695)
==6153== by 0x7B88CC6: void Uintah::SFC<double>::Parallel0<3, unsigned char>() (SFC.h:2928)
==6153== by 0x7C00AAB: void Uintah::SFC<double>::Parallel<3, unsigned char>() (SFC.h:1108)
==6153== by 0x7C0EF39: void Uintah::SFC<double>::GenerateDim<3>(int) (SFC.h:694)
==6153== by 0x7C0F0F2: Uintah::SFC<double>::GenerateCurve(int) (SFC.h:670)
==6153== by 0x7B30CAC: Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle<Uintah::Level> const&, int*) (DynamicLoadBalancer.cc:429)
==6153== Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil) (segmentation violation)
Looking at the code for our isend at SFC.h:298 does not seem to have any errors:
=============================================
MergeInfo<BITS> myinfo,theirinfo;
MPI_Request srequest, rrequest;
MPI_Status status;
myinfo.n=n;
if(n!=0)
{
myinfo.min=sendbuf[0].bits;
myinfo.max=sendbuf[n-1].bits;
}
//cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" << (int)myinfo.max << endl;
MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest);
==============================================
myinfo is a struct located on the stack, to is the rank of the processor that the message is being sent to, and srequest is also on the stack. When I don't run with valgrind my program runs past this point just fine.
I am currently using openmpi 1.3 from the debian unstable branch. I also see the same type of segfault in a different portion of the code involving an MPI_Allgather which can be seen below:
==============================================
==22736== Use of uninitialised value of size 8
==22736== at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736== by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60)
==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736== by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736== by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243)
==22736== by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117)
==22736== by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736== at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736== by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60)
==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736== by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736== by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243)
==22736== by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117)
==22736== by 0x4089AE: main (sus.cc:629)
================================================================
Are these problems with openmpi and is there any known work arounds?
Thanks,
Justin