Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2014-01-27 16:02:43


Nathan,

To encourage you to focus on 1.7.4, I will delay testing vader on the SGI
UV until I've tested the next 1.7.4 release candidate (or final).

-Paul

On Mon, Jan 27, 2014 at 12:55 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Just FWIW: I believe that problem did indeed make it over to 1.7.4, and
> that release is on "hold" pending your fix. So while I'm happy to hear
> about xpmem on SGI, please do let us release 1.7.4!
>
>
> On Jan 27, 2014, at 8:19 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>
> > Yup. Has to do with not having 64-bit atomic math. The fix is complete
> > but I am working on another update to enable using xpmem on SGI
> > systems. I will push the changes once that is complete.
> >
> > -Nathan
> >
> > On Mon, Jan 27, 2014 at 04:00:08PM +0000, Jeff Squyres (jsquyres) wrote:
> >> Is this the same issue Absoft is seeing in 32 bit builds on the trunk?
> (i.e., 100% failure rate)
> >>
> >> http://mtt.open-mpi.org/index.php?do_redir=2142
> >>
> >>
> >> On Jan 27, 2014, at 10:38 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
> >>
> >>> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
> >>> updates have been moved yet. As for trunk I am working on a 32-bit fix
> >>> for vader and it should be in later today. I will have to track down
> >>> what is going wrong the basesmuma initialization.
> >>>
> >>> -Nathan
> >>>
> >>> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
> >>>> I noticed two major issues on 32 bits machines. The first one is with
> the vader BTL and the second with the selection logic in basesmuma
> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7.
> >>>>
> >>>> If I turn off vader and boll via the MCA parameters everything runs
> just fine.
> >>>>
> >>>> George.
> >>>>
> >>>> ../trunk/configure --enable-debug --disable-mpi-cxx
> --disable-mpi-fortran --disable-io-romio
> --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default
> >>>>
> >>>>
> >>>> - Vader generates a segfault for any application even with only 2
> processes, so this should be pretty easy to track.
> >>>>
> >>>> Program received signal SIGSEGV, Segmentation fault.
> >>>> (gdb) bt
> >>>> #0 0x00000000 in ?? ()
> >>>> #1 0x00ae43b3 in mca_btl_vader_poll_fifo ()
> >>>> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
> >>>> #2 0x00ae444a in mca_btl_vader_component_progress ()
> >>>> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
> >>>> #3 0x008fdb95 in opal_progress ()
> >>>> at ../../trunk/opal/runtime/opal_progress.c:186
> >>>> #4 0x001961bc in ompi_request_default_test_some (count=13,
> >>>> requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60,
> >>>> statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
> >>>> #5 0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48,
> >>>> outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
> >>>> at ptestsome.c:81
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> - basesmuma overwrite the memory. The results_array can’t be released
> as the memory is corrupted. I did not have time to investigate too much but
> it looks like the pload_mgmt->data_bffs either too small or somehow data is
> stored outside its boundaries.
> >>>>
> >>>> *** glib detected ***
> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
> free(): invalid next size (fast): 0x081f0798 ***
> >>>>
> >>>> (gdb) bt
> >>>> #0 0x00130424 in __kernel_vsyscall ()
> >>>> #1 0x006bfb11 in raise () from /lib/libc.so.6
> >>>> #2 0x006c13ea in abort () from /lib/libc.so.6
> >>>> #3 0x006ff9d5 in __libc_message () from /lib/libc.so.6
> >>>> #4 0x00705e31 in malloc_printerr () from /lib/libc.so.6
> >>>> #5 0x00708571 in _int_free () from /lib/libc.so.6
> >>>> #6 0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60,
> >>>> bcol_module=0xb30b3008, reg_data=0x81e6698)
> >>>> at
> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
> >>>> #7 0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
> >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
> >>>> #8 0x00b70651 in ml_module_memory_initialization
> (ml_module=0x81dfe60)
> >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
> >>>> #9 0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
> >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
> >>>> #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0,
> priority=0xbfffe558)
> >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
> >>>> #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0,
> >>>> priority=0xbfffe558, module=0xbfffe580)
> >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375
> >>>> #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0,
> >>>> priority=0xbfffe558, module=0xbfffe580)
> >>>> ---Type <return> to continue, or q <return> to quit---
> >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358
> >>>> #13 0x00202d9e in check_one_component (comm=0x8127da0,
> component=0xbc6500,
> >>>> module=0xbfffe580)
> >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320
> >>>> #14 0x00202bce in check_components (components=0x253d70,
> comm=0x8127da0)
> >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:284
> >>>> #15 0x001fbbe1 in mca_coll_base_comm_select (comm=0x8127da0)
> >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:117
> >>>> #16 0x0019872f in ompi_mpi_init (argc=7, argv=0xbfffee74, requested=0,
> >>>> provided=0xbfffe970) at ../../trunk/ompi/runtime/ompi_mpi_init.c:894
> >>>> #17 0x001c9509 in PMPI_Init (argc=0xbfffe9c0, argv=0xbfffe9c4) at
> pinit.c:84
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> --
> >> Jeff Squyres
> >> jsquyres_at_[hidden]
> >> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900