Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
From: Gutierrez, Samuel K (samuel_at_[hidden])
Date: 2012-03-13 18:57:53


On Mar 13, 2012, at 4:07 PM, Joshua Baker-LePain wrote:

> On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote
>
>>>> Any more information surrounding your failures in 1.5.4 are greatly appreciated.
>>>
>>> I'm happy to provide, but what exactly are you looking for? The test code I'm running is *very* simple:
>>
>> If you experience this type of failure with 1.4.5, can you send another backtrace? We'll go from there.
>

Fooey. What compiler are you using to build Open MPI and how are you configuring your build? Can you also run with a debug build of Open MPI so we can see the line numbers?

> In an odd way I'm relieved to say that 1.4.5 failed in the same way. From the SGE log of the run, here's the error message from one of the threads that segfaulted:
> [iq104:05697] *** Process received signal ***
> [iq104:05697] Signal: Segmentation fault (11)
> [iq104:05697] Signal code: Address not mapped (1)
> [iq104:05697] Failing at address: 0x2ad032188e8c
> [iq104:05697] [ 0] /lib64/libpthread.so.0() [0x3e5420f4a0]
> [iq104:05697] [ 1] /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so(+0x3c4c) [0x2b0099ec4c4c]
> [iq104:05697] [ 2] /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0(opal_progress+0x6a) [0x2b00967737ca]
> [iq104:05697] [ 3] /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so(+0x18d5) [0x2b00975ef8d5]
> [iq104:05697] [ 4] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(+0x38a24) [0x2b009628da24]
> [iq104:05697] [ 5] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(MPI_Init+0x1b0) [0x2b00962b24f0]
> [iq104:05697] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug(main+0x22) [0x400826]
> [iq104:05697] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3e53e1ecdd]
> [iq104:05697] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug() [0x400749]
> [iq104:05697] *** End of error message ***
>
> And the backtrace of the resulting core file:
> #0 0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
> from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so
> #1 0x00002b00967737ca in opal_progress ()
> from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0
> #2 0x00002b00975ef8d5 in barrier ()
> from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so
> #3 0x00002b009628da24 in ompi_mpi_init ()
> from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
> #4 0x00002b00962b24f0 in PMPI_Init ()
> from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
> #5 0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
> at mpihello-long.c:11
>
>> Another question. How reproducible is this on your system?
>
> In my testing today, it's been 100% reproducible.

That's surprising.

Thanks,

Sam

>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users