Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc2r30148 - crash in MPI_Init on Linux/x86
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-08 23:35:03


If you have the time, it might be worth nailing it down. However, I'm mindful of all the things you need to do, so please only if you have the time.

Thanks
Ralph

On Jan 8, 2014, at 8:23 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> Ralph,
>
> Building with gcc-4.1.2 fixed the problem for me. I also removed an old install of ompi-1.4 that was in LD_LIBRARY_PATH at build time and might have been a contributing factor. If I'd known earlier that it was there, I wouldn't have reported the problem without first removing it.
>
> I can build again with gcc-4.0.0 and --enable-debug if you are still interested in trying to get a line number. This would also determine if LD_LIBRARY_PATH was the true culprit.
>
> -Paul [Sent from my phone]
>
> On Jan 8, 2014 8:02 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
> Most likely problem is a bad backing store site - any chance you could give me a line number from this? There are a lot of calls to register params in that code and I'd need some help in figuring out which one wasn't right.
>
>
> On Jan 8, 2014, at 6:59 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> I am still testing the current 1.7.4rc tarball on my various systems. The latest failure (shown below) is a SEGV somewhere below MPI_Init on a old, but otherwise fairly normal, Linux/x86 (32-bit) system.
>>
>> $ /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/bin/mpirun -np 1 examples/ring_c
>> [pcp-j-6:29031] *** Process received signal ***
>> [pcp-j-6:29031] Signal: Segmentation fault (11)
>> [pcp-j-6:29031] Signal code: Address not mapped (1)
>> [pcp-j-6:29031] Failing at address: 0x6c6c6f63
>> [pcp-j-6:29031] [ 0] [0xbe4440]
>> [pcp-j-6:29031] [ 1] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_var_enum_create+0x15d) [0x2b11ed]
>> [pcp-j-6:29031] [ 2] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_register_params+0x639) [0x440909]
>> [pcp-j-6:29031] [ 3] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_components_register+0x14e) [0x2b2cce]
>> [pcp-j-6:29031] [ 4] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_register+0x1b5) [0x2b32a5]
>> [pcp-j-6:29031] [ 5] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_open+0x4e) [0x2b333e]
>> [pcp-j-6:29031] [ 6] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libmpi.so.1(ompi_mpi_init+0x53d) [0xaf359d]
>> [pcp-j-6:29031] [ 7] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libmpi.so.1(MPI_Init+0x13d) [0xb10d6d]
>> [pcp-j-6:29031] [ 8] examples/ring_c [0x80486e9]
>> [pcp-j-6:29031] [ 9] /lib/libc.so.6(__libc_start_main+0xdc) [0x125ebc]
>> [pcp-j-6:29031] [10] examples/ring_c [0x8048631]
>> [pcp-j-6:29031] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 29031 on node pcp-j-6 exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>>
>> The failure shown is for a singleton run, but np=2 fails as well.
>>
>> System info:
>> $ uname -a
>> Linux pcp-j-6 2.6.18-238.1.1.el5PAE #1 SMP Tue Jan 18 19:28:42 EST 2011 i686 athlon i386 GNU/Linux
>> $ gcc --version
>> gcc (GCC) 4.0.0
>> Copyright (C) 2005 Free Software Foundation, Inc.
>> This is free software; see the source for copying conditions. There is NO
>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>>
>> The only configure argument used was --prefix.
>>
>> I was going to attach output from "ompi_info --all", but it SEGV's too!
>>
>> $ ompi_info --all
>> [pcp-j-6:29092] *** Process received signal ***
>> [pcp-j-6:29092] Signal: Segmentation fault (11)
>> [pcp-j-6:29092] Signal code: Address not mapped (1)
>> [pcp-j-6:29092] Failing at address: 0x6c6c6f63
>> [pcp-j-6:29092] [ 0] [0xd8a440]
>> [pcp-j-6:29092] [ 1] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_var_enum_create+0x15d) [0x2db1ed]
>> [pcp-j-6:29092] [ 2] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_register_params+0x639) [0x48d909]
>> [pcp-j-6:29092] [ 3] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_components_register+0x14e) [0x2dccce]
>> [pcp-j-6:29092] [ 4] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(mca_base_framework_register+0x1b5) [0x2dd2a5]
>> [pcp-j-6:29092] [ 5] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libopen-pal.so.6(opal_info_register_project_frameworks+0x57) [0x2b83d7]
>> [pcp-j-6:29092] [ 6] /home/pcp1/phargrov/OMPI/openmpi-1.7-latest-linux-x86/INST/lib/libmpi.so.1(ompi_info_register_framework_params+0x81) [0xa69251]
>> [pcp-j-6:29092] [ 7] ompi_info(main+0x2ba) [0x8049a2a]
>> [pcp-j-6:29092] [ 8] /lib/libc.so.6(__libc_start_main+0xdc) [0x125ebc]
>> [pcp-j-6:29092] [ 9] ompi_info [0x80496e1]
>> [pcp-j-6:29092] *** End of error message ***
>> Segmentation fault (core dumped)
>>
>> I will try again with a newer gcc and report back.
>>
>> -Paul
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel