Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] invalid write in opal_generic_simple_unpack
From: Patrik Jonsson (code_at_[hidden])
Date: 2012-03-14 12:38:52


Hi,

I'm trying to track down a spurious segmentation fault that I'm
getting with my MPI application. I tried using valgrind, and after
suppressing the 25,000 errors in PMPI_Init_thread and associated
Init/Finalize functions, I'm left with an uninitialized write in
PMPI_Isend (which I saw is not unexpected), plus this:

==11541== Thread 1:
==11541== Invalid write of size 1
==11541== at 0x4A09C9F: _intel_fast_memcpy (mc_replace_strmem.c:650)
==11541== by 0x5093447: opal_generic_simple_unpack
(opal_datatype_unpack.c:420)
==11541== by 0x508D642: opal_convertor_unpack (opal_convertor.c:302)
==11541== by 0x4F8FD1A: mca_pml_ob1_recv_frag_callback_match
(pml_ob1_recvfrag.c:217)
==11541== by 0x4ED51BD: mca_btl_tcp_endpoint_recv_handler
(btl_tcp_endpoint.c:718)
==11541== by 0x509644F: opal_event_loop (event.c:766)
==11541== by 0x507FA50: opal_progress (opal_progress.c:189)
==11541== by 0x4E95AFE: ompi_request_default_test (req_test.c:88)
==11541== by 0x4EB8077: PMPI_Test (ptest.c:61)
==11541== by 0x78C4339: boost::mpi::request::test() (in
/n/home00/pjonsson/lib/libboost_mpi.so.1
.48.0)
==11541== by 0x4B5DA3:
mcrx::mpi_master<test_xfer>::process_handshakes()
(mpi_master_impl.h:216)
==11541== by 0x4B5557: mcrx::mpi_master<test_xfer>::run()
(mpi_master_impl.h:541)
==11541== Address 0x7feffb327 is just below the stack ptr. To
suppress, use: --workaround-gcc296-
bugs=yes

The test in question tests for a single int being sent between the
tasks. This is done using the Boost::MPI skeleton/content mechanism,
and the receive is done to an element of a std::vector, so there's no
reason it should unpack anywhere near the stack ptr. However, an int
should be size 4.

This looks suspicious given that the segfault would usually happen in
one of the calls to PMPI_Test. If somehow the data is unpacked to
somewhere around the stack pointer, that certainly seems like a
possible cause.

If anyone can give me some ideas for what could cause this and how to
track it down, I'd appreciate it. I'm running out of ideas here.

Regards,

/Patrik J.