Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] 1.7.4rc2r30148 - crash in MPI_Init on Linux/x86
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-01-09 10:26:24


Shoot. My bad there. Looks like the enumerator sentinel is missing. Will fix now.

-Nathan

On Wed, Jan 08, 2014 at 09:27:46PM -0800, Paul Hargrove wrote:
> Ralph,
> When rebuilding with --enable-debug and the original gcc-4.0.0 the SEGV
> returns.
> So, the ompi-1.4 in the LD_LIBRARY_PATH was NOT the cause.
> Below is a backtrace from gdb which includes line numbers.
> The SEGV is in strlen() which suggests a string which lacks
> null-termination.
> The initial (siginfo) part of the backtrace provided by Open MPI reads:
> [pcp-j-6:02741] *** Process received signal ***
> [pcp-j-6:02741] Signal: Segmentation fault (11)
> [pcp-j-6:02741] Signal code: Address not mapped (1)
> [pcp-j-6:02741] Failing at address: 0x63757274
> -Paul
> #0 0x00a5dbb3 in strlen () from /lib/libc.so.6
> #1 0x00a5d8f5 in strdup () from /lib/libc.so.6
> #2 0x00534a3b in mca_base_var_enum_create (name=0x349488
> "coll_ml_enable_fragmentation_enum",
> values=0x34e014, enumerator=0xbfe03dd0)
> at
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/openmpi-1.7.4rc2r30168/opal/mca/base/mca_base_var_enum.c:133
> #3 0x0033c328 in mca_coll_ml_register_params ()
> at
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/openmpi-1.7.4rc2r30168/ompi/mca/coll/ml/coll_ml_mca.c:257
> #4 0x00537585 in register_components (project_name=0x2056f3 "ompi",
> type_name=0x2056f8 "coll",
> output_id=-1, src=0xbfe03e7c, dest=0x21bd10)
> at
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/openmpi-1.7.4rc2r30168/opal/mca/base/mca_base_components_register.c:116
> #5 0x0053736a in mca_base_framework_components_register
> (framework=0x21bce0,
> flags=MCA_BASE_REGISTER_DEFAULT)
> at
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/openmpi-1.7.4rc2r30168/opal/mca/base/mca_base_components_register.c:67
> #6 0x00537ec1 in mca_base_framework_register (framework=0x21bce0,
> flags=MCA_BASE_REGISTER_DEFAULT)
> at
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/openmpi-1.7.4rc2r30168/opal/mca/base/mca_base_framework.c:107
> #7 0x00537f6f in mca_base_framework_open (framework=0x21bce0,
> flags=MCA_BASE_OPEN_DEFAULT)
> at
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/openmpi-1.7.4rc2r30168/opal/mca/base/mca_base_framework.c:131
> #8 0x00152831 in ompi_mpi_init (argc=1, argv=0xbfe04114, requested=0,
> provided=0xbfe0400c)
> at
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/openmpi-1.7.4rc2r30168/ompi/runtime/ompi_mpi_init.c:555
> #9 0x00186ce1 in PMPI_Init (argc=0xbfe04090, argv=0xbfe04094) at
> pinit.c:84
> #10 0x080486e9 in main (argc=1, argv=0xbfe04114) at ring_c.c:19
>
> On Wed, Jan 8, 2014 at 8:45 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> Only takes <30 seconds of typing to start the test and I get email when
> it is done.
> Typing these emails takes more of my time than the actual testing does.
> -Paul
>
> On Wed, Jan 8, 2014 at 8:35 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> If you have the time, it might be worth nailing it down. However, I'm
> mindful of all the things you need to do, so please only if you have
> the time.
> Thanks
> Ralph
> On Jan 8, 2014, at 8:23 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> Ralph,
>
> Building with gcc-4.1.2 fixed the problem for me. I also removed an
> old install of ompi-1.4 that was in LD_LIBRARY_PATH at build time
> and might have been a contributing factor. If I'd known earlier
> that it was there, I wouldn't have reported the problem without
> first removing it.
>
> I can build again with gcc-4.0.0 and --enable-debug if you are still
> interested in trying to get a line number. This would also
> determine if LD_LIBRARY_PATH was the true culprit.
>
> -Paul [Sent from my phone]
>
> On Jan 8, 2014 8:02 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>
> Most likely problem is a bad backing store site - any chance you
> could give me a line number from this? There are a lot of calls to
> register params in that code and I'd need some help in figuring
> out which one wasn't right.
> On Jan 8, 2014, at 6:59 PM, Paul Hargrove <phhargrove_at_[hidden]>
> wrote:
>
> I am still testing the current 1.7.4rc tarball on my various
> systems. The latest failure (shown below) is a SEGV somewhere
> below MPI_Init on a old, but otherwise fairly normal, Linux/x86
> (32-bit) system.
> $
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/bin/mpirun
> -np 1 examples/ring_c
> [pcp-j-6:29031] *** Process received signal ***
> [pcp-j-6:29031] Signal: Segmentation fault (11)
> [pcp-j-6:29031] Signal code: Address not mapped (1)
> [pcp-j-6:29031] Failing at address: 0x6c6c6f63
> [pcp-j-6:29031] [ 0] [0xbe4440]
> [pcp-j-6:29031] [ 1]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_var_enum_create+0x15d)
> [0x2b11ed]
> [pcp-j-6:29031] [ 2]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_register_params+0x639)
> [0x440909]
> [pcp-j-6:29031] [ 3]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_components_register+0x14e)
> [0x2b2cce]
> [pcp-j-6:29031] [ 4]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_register+0x1b5)
> [0x2b32a5]
> [pcp-j-6:29031] [ 5]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_open+0x4e)
> [0x2b333e]
> [pcp-j-6:29031] [ 6]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libmpi.so.1(ompi_mpi_init+0x53d)
> [0xaf359d]
> [pcp-j-6:29031] [ 7]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libmpi.so.1(MPI_Init+0x13d)
> [0xb10d6d]
> [pcp-j-6:29031] [ 8] examples/ring_c [0x80486e9]
> [pcp-j-6:29031] [ 9] /lib/libc.so.6(__libc_start_main+0xdc)
> [0x125ebc]
> [pcp-j-6:29031] [10] examples/ring_c [0x8048631]
> [pcp-j-6:29031] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 29031 on node
> pcp-j-6 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> The failure shown is for a singleton run, but np=2 fails as
> well.
> System info:
> $ uname -a
> Linux pcp-j-6 2.6.18-238.1.1.el5PAE #1 SMP Tue Jan 18 19:28:42
> EST 2011 i686 athlon i386 GNU/Linux
> $ gcc --version
> gcc (GCC) 4.0.0
> Copyright (C) 2005 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.
> There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.
> The only configure argument used was --prefix.
> I was going to attach output from "ompi_info --all", but it
> SEGV's too!
> $ ompi_info --all
> [pcp-j-6:29092] *** Process received signal ***
> [pcp-j-6:29092] Signal: Segmentation fault (11)
> [pcp-j-6:29092] Signal code: Address not mapped (1)
> [pcp-j-6:29092] Failing at address: 0x6c6c6f63
> [pcp-j-6:29092] [ 0] [0xd8a440]
> [pcp-j-6:29092] [ 1]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_var_enum_create+0x15d)
> [0x2db1ed]
> [pcp-j-6:29092] [ 2]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_register_params+0x639)
> [0x48d909]
> [pcp-j-6:29092] [ 3]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_components_register+0x14e)
> [0x2dccce]
> [pcp-j-6:29092] [ 4]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_register+0x1b5)
> [0x2dd2a5]
> [pcp-j-6:29092] [ 5]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(opal_info_register_project_frameworks+0x57)
> [0x2b83d7]
> [pcp-j-6:29092] [ 6]
> /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libmpi.so.1(ompi_info_register_framework_params+0x81)
> [0xa69251]
> [pcp-j-6:29092] [ 7] ompi_info(main+0x2ba) [0x8049a2a]
> [pcp-j-6:29092] [ 8] /lib/libc.so.6(__libc_start_main+0xdc)
> [0x125ebc]
> [pcp-j-6:29092] [ 9] ompi_info [0x80496e1]
> [pcp-j-6:29092] *** End of error message ***
> Segmentation fault (core dumped)
> I will try again with a newer gcc and report back.
> -Paul
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pgp-signature attachment: stored