Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures
From: George Bosilca (bosilca_at_[hidden])
Date: 2014-01-26 10:10:29

I noticed two major issues on 32 bits machines. The first one is with the vader BTL and the second with the selection logic in basesmuma (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7.

If I turn off vader and boll via the MCA parameters everything runs just fine.


../trunk/configure --enable-debug --disable-mpi-cxx --disable-mpi-fortran --disable-io-romio --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default

- Vader generates a segfault for any application even with only 2 processes, so this should be pretty easy to track.

Program received signal SIGSEGV, Segmentation fault.
(gdb) bt
#0 0x00000000 in ?? ()
#1 0x00ae43b3 in mca_btl_vader_poll_fifo ()
    at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
#2 0x00ae444a in mca_btl_vader_component_progress ()
    at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
#3 0x008fdb95 in opal_progress ()
    at ../../trunk/opal/runtime/opal_progress.c:186
#4 0x001961bc in ompi_request_default_test_some (count=13,
    requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60,
    statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
#5 0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48,
    outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
    at ptestsome.c:81

- basesmuma overwrite the memory. The results_array can’t be released as the memory is corrupted. I did not have time to investigate too much but it looks like the pload_mgmt->data_bffs either too small or somehow data is stored outside its boundaries.

*** glib detected *** /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf: free(): invalid next size (fast): 0x081f0798 ***

(gdb) bt
#0 0x00130424 in __kernel_vsyscall ()
#1 0x006bfb11 in raise () from /lib/
#2 0x006c13ea in abort () from /lib/
#3 0x006ff9d5 in __libc_message () from /lib/
#4 0x00705e31 in malloc_printerr () from /lib/
#5 0x00708571 in _int_free () from /lib/
#6 0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60,
    bcol_module=0xb30b3008, reg_data=0x81e6698)
    at ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
#7 0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
#8 0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60)
    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
#9 0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
#10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, priority=0xbfffe558)
    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
#11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0,
    priority=0xbfffe558, module=0xbfffe580)
    at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375
#12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0,
    priority=0xbfffe558, module=0xbfffe580)
---Type <return> to continue, or q <return> to quit---
    at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358
#13 0x00202d9e in check_one_component (comm=0x8127da0, component=0xbc6500,
    at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320
#14 0x00202bce in check_components (components=0x253d70, comm=0x8127da0)
    at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:284
#15 0x001fbbe1 in mca_coll_base_comm_select (comm=0x8127da0)
    at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:117
#16 0x0019872f in ompi_mpi_init (argc=7, argv=0xbfffee74, requested=0,
    provided=0xbfffe970) at ../../trunk/ompi/runtime/ompi_mpi_init.c:894
#17 0x001c9509 in PMPI_Init (argc=0xbfffe9c0, argv=0xbfffe9c4) at pinit.c:84