Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-06-04 16:47:57


Urrrrrggg...unfortunately, the people who know the most about that code are all at the MPI Forum this week, so we may not be able to fully address it until their return. It looks like you are still going down into that malloc interceptor, so I'm not correctly blocking it for you.

This run segfaulted in a completely different call in a different part of the startup procedure - but in the same part of the interceptor, which makes me suspicious. Don't know how much testing we've seen on SLES...

On Jun 4, 2014, at 1:18 PM, Fischer, Greg A. <fischega_at_[hidden]> wrote:

> Ralph,
>
> It segfaults. Here's the backtrace:
>
> Core was generated by `ring_c'.
> Program terminated with signal 11, Segmentation fault.
> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, bytes=47840385564856) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
> 4098 bck->fd = unsorted_chunks(av);
> (gdb) bt
> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, bytes=47840385564856) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
> #1 0x00002b82b1a47e38 in opal_memory_ptmalloc2_malloc (bytes=47840385564704) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433
> #2 0x00002b82b1a47b36 in opal_memory_linux_malloc_hook (sz=47840385564704, caller=0x2b82b53000b8) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691
> #3 0x00002b82b19e7b18 in opal_malloc (size=47840385564704, file=0x2b82b53000b8 "", line=12) at ../../../openmpi-1.8.1/opal/util/malloc.c:101
> #4 0x00002b82b199c017 in opal_hash_table_set_value_uint64 (ht=0x2b82b5300020, key=47840385564856, value=0xc) at ../../openmpi-1.8.1/opal/class/opal_hash_table.c:283
> #5 0x00002b82b170e4ca in process_uri (uri=0x2b82b5300020 "\001") at ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:348
> #6 0x00002b82b170e941 in orte_oob_base_set_addr (fd=-1255145440, args=184, cbdata=0xc) at ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:296
> #7 0x00002b82b19fba1c in event_process_active_single_queue (base=0x655480, activeq=0x654920) at ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1367
> #8 0x00002b82b19fbcd9 in event_process_active (base=0x655480) at ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1437
> #9 0x00002b82b19fc4c3 in opal_libevent2021_event_base_loop (base=0x655480, flags=1) at ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1645
> #10 0x00002b82b16f8763 in orte_progress_thread_engine (obj=0x2b82b5300020) at ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:456
> #11 0x00002b82b0f1c7b6 in start_thread () from /lib64/libpthread.so.0
> #12 0x00002b82b1410d6d in clone () from /lib64/libc.so.6
> #13 0x0000000000000000 in ?? ()
>
> Greg
>
> -----Original Message-----
> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain
> Sent: Wednesday, June 04, 2014 3:49 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c
>
> Sorry for delay - digging my way out of the backlog. This is very strange as you are failing in a simple asprintf call. We check that all the players are non-NULL, and it appears that you are failing to allocate the memory for the resulting (rather short) string.
>
> I'm wondering if this is some strange interaction between SLES, the Intel compiler, and our malloc interceptor - or if there is some difference between the malloc libraries on the two machines. Let's try running it without the malloc interceptor and see if that helps.
>
> Try running with "-mca memory ^linux" on your cmd line
>
>
> On Jun 4, 2014, at 9:58 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> He isn't getting that far - he's failing in MPI_Init when the RTE
>> attempts to connect to the local daemon
>>
>>
>> On Jun 4, 2014, at 9:53 AM, Gus Correa <gus_at_[hidden]> wrote:
>>
>>> Hi Greg
>>>
>>> From your original email:
>>>
>>>>> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
>>>
>>> This may not fix the problem,
>>> but have you tried to add the shared memory btl to your mca parameter?
>>>
>>> mpirun -np 2 --mca btl openib,sm,self ring_c
>>>
>>> As far as I know, sm is the preferred transport layer for intra-node
>>> communication.
>>>
>>> Gus Correa
>>>
>>>
>>> On 06/04/2014 11:13 AM, Ralph Castain wrote:
>>>> Thanks!! Really appreciate your help - I'll try to figure out what
>>>> went wrong and get back to you
>>>>
>>>> On Jun 4, 2014, at 8:07 AM, Fischer, Greg A.
>>>> <fischega_at_[hidden] <mailto:fischega_at_[hidden]>> wrote:
>>>>
>>>>> I re-ran with 1 processor and got more information. How about this?
>>>>> Core was generated by `ring_c'.
>>>>> Program terminated with signal 11, Segmentation fault.
>>>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020,
>>>>> bytes=47592367980728) at
>>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>>>> 4098 bck->fd = unsorted_chunks(av);
>>>>> (gdb) bt
>>>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020,
>>>>> bytes=47592367980728) at
>>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>>>> #1 0x00002b48f2a15e38 in opal_memory_ptmalloc2_malloc
>>>>> (bytes=47592367980576) at
>>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433
>>>>> #2 0x00002b48f2a15b36 in opal_memory_linux_malloc_hook
>>>>> (sz=47592367980576, caller=0x2b48f63000b8) at
>>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691
>>>>> #3 0x00002b48f2374b90 in vasprintf () from /lib64/libc.so.6
>>>>> #4 0x00002b48f2354148 in asprintf () from /lib64/libc.so.6
>>>>> #5 0x00002b48f26dc7d1 in orte_oob_base_get_addr
>>>>> (uri=0x2b48f6300020) at
>>>>> ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:234
>>>>> #6 0x00002b48f53e7d4a in orte_rml_oob_get_uri () at
>>>>> ../../../../../openmpi-1.8.1/orte/mca/rml/oob/rml_oob_contact.c:36
>>>>> #7 0x00002b48f26fa181 in orte_routed_base_register_sync (setup=32 '
>>>>> ') at
>>>>> ../../../../openmpi-1.8.1/orte/mca/routed/base/routed_base_fns.c:30
>>>>> 1
>>>>> #8 0x00002b48f4bbcccf in init_routes (job=4130340896,
>>>>> ndat=0x2b48f63000b8) at
>>>>> ../../../../../openmpi-1.8.1/orte/mca/routed/binomial/routed_binomi
>>>>> al.c:705
>>>>> #9 0x00002b48f26c615d in orte_ess_base_app_setup
>>>>> (db_restrict_local=32 ' ') at
>>>>> ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:245
>>>>> #10 0x00002b48f45b069f in rte_init () at
>>>>> ../../../../../openmpi-1.8.1/orte/mca/ess/env/ess_env_module.c:146
>>>>> #11 0x00002b48f26935ab in orte_init (pargc=0x2b48f6300020,
>>>>> pargv=0x2b48f63000b8, flags=8) at
>>>>> ../../openmpi-1.8.1/orte/runtime/orte_init.c:148
>>>>> #12 0x00002b48f1739d38 in ompi_mpi_init (argc=1,
>>>>> argv=0x7fffebf0d1f8, requested=8, provided=0x0) at
>>>>> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:464
>>>>> #13 0x00002b48f1760a37 in PMPI_Init (argc=0x2b48f6300020,
>>>>> argv=0x2b48f63000b8) at pinit.c:84
>>>>> #14 0x00000000004024ef in main (argc=1, argv=0x7fffebf0d1f8) at
>>>>> ring_c.c:19
>>>>> *From:*users [mailto:users-bounces_at_[hidden]]*On Behalf Of*Ralph
>>>>> Castain *Sent:*Wednesday, June 04, 2014 11:00 AM *To:*Open MPI
>>>>> Users
>>>>> *Subject:*Re: [OMPI users] intermittent segfaults with openib on
>>>>> ring_c.c Does the trace go any further back? Your prior trace
>>>>> seemed to indicate an error in our OOB framework, but in a very basic place.
>>>>> Looks like it could be an uninitialized variable, and having the
>>>>> line number down as deep as possible might help identify the source
>>>>> On Jun 4, 2014, at 7:55 AM, Fischer, Greg A.
>>>>> <fischega_at_[hidden] <mailto:fischega_at_[hidden]>> wrote:
>>>>>
>>>>>
>>>>> Oops, ulimit was set improperly. I generated a core file, loaded it
>>>>> in GDB, and ran a backtrace:
>>>>> Core was generated by `ring_c'.
>>>>> Program terminated with signal 11, Segmentation fault.
>>>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020,
>>>>> bytes=47890224382136) at
>>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>>>> 4098 bck->fd = unsorted_chunks(av);
>>>>> (gdb) bt
>>>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020,
>>>>> bytes=47890224382136) at
>>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>>>> #1 0x0000000000000000 in ?? ()
>>>>> Is that helpful?
>>>>> Greg
>>>>> *From:*Fischer, Greg A.
>>>>> *Sent:*Wednesday, June 04, 2014 10:17 AM *To:*'Open MPI Users'
>>>>> *Cc:*Fischer, Greg A.
>>>>> *Subject:*RE: [OMPI users] intermittent segfaults with openib on
>>>>> ring_c.c I recompiled with "-enable-debug" but it doesn't seem to
>>>>> be providing any more information or a core dump. I'm compiling ring.c with:
>>>>> mpicc ring_c.c -g -traceback -o ring_c and running with:
>>>>> mpirun -np 4 --mca btl openib,self ring_c and I'm getting:
>>>>> [binf112:05845] *** Process received signal *** [binf112:05845]
>>>>> Signal: Segmentation fault (11) [binf112:05845] Signal code:
>>>>> Address not mapped (1) [binf112:05845] Failing at address: 0x10
>>>>> [binf112:05845] [ 0]
>>>>> /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0]
>>>>> [binf112:05845] [ 1]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pa
>>>>> l.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03]
>>>>> [binf112:05845] [ 2]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pa
>>>>> l.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288]
>>>>> [binf112:05845] [ 3]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pa
>>>>> l.so.6(+0xd1f86)[0x2b2fa4ff4f86] [binf112:05845] [ 4]
>>>>> /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e]
>>>>> [binf112:05845] [ 5]
>>>>> /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148]
>>>>> [binf112:05845] [ 6]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rt
>>>>> e.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e
>>>>> 2]
>>>>> [binf112:05845] [ 7]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rt
>>>>> e.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15]
>>>>> [binf112:05845] [ 8]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mc
>>>>> a_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a]
>>>>> [binf112:05845] [ 9]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rt
>>>>> e.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d]
>>>>> [binf112:05845] [10]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mc
>>>>> a_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b]
>>>>> [binf112:05845] [11]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rt
>>>>> e.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d]
>>>>> [binf112:05845] [12]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mc
>>>>> a_ess_env.so(+0x169f)[0x2b2fa6b8f69f]
>>>>> [binf112:05845] [13]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rt
>>>>> e.so.7(orte_init+0x17b)[0x2b2fa4c764bb]
>>>>> [binf112:05845] [14]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.
>>>>> 1(ompi_mpi_init+0x438)[0x2b2fa3d1e198]
>>>>> [binf112:05845] [15]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.
>>>>> 1(MPI_Init+0xf7)[0x2b2fa3d44947] [binf112:05845] [16]
>>>>> ring_c[0x4024ef] [binf112:05845] [17]
>>>>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36]
>>>>> [binf112:05845] [18] ring_c[0x4023f9] [binf112:05845] *** End of
>>>>> error message ***
>>>>> -------------------------------------------------------------------
>>>>> ------- mpirun noticed that process rank 3 with PID 5845 on node
>>>>> xxxx112 exited on signal 11 (Segmentation fault).
>>>>> -------------------------------------------------------------------
>>>>> -------
>>>>> Does any of that help?
>>>>> Greg
>>>>> *From:*users [mailto:users-bounces_at_[hidden]]*On Behalf Of*Ralph
>>>>> Castain *Sent:*Tuesday, June 03, 2014 11:54 PM *To:*Open MPI Users
>>>>> *Subject:*Re: [OMPI users] intermittent segfaults with openib on
>>>>> ring_c.c Sounds odd - can you configure OMPI --enable-debug and run it again?
>>>>> If it fails and you can get a core dump, could you tell us the line
>>>>> number where it is failing?
>>>>> On Jun 3, 2014, at 9:58 AM, Fischer, Greg A.
>>>>> <fischega_at_[hidden] <mailto:fischega_at_[hidden]>> wrote:
>>>>>
>>>>> Apologies - I forgot to add some of the information requested by the FAQ:
>>>>>
>>>>> 1.OpenFabrics is provided by the Linux distribution:
>>>>>
>>>>> [binf102:fischega] $ rpm -qa | grep ofed
>>>>> ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
>>>>> ofed-1.5.4.1-0.11.5
>>>>> ofed-doc-1.5.4.1-0.11.5
>>>>>
>>>>>
>>>>> 2.Linux Distro / Kernel:
>>>>>
>>>>> [binf102:fischega] $ cat /etc/SuSE-release SUSE Linux Enterprise
>>>>> Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 3
>>>>>
>>>>> [binf102:fischega] $ uname -a
>>>>> Linux xxxx102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC
>>>>> 2013
>>>>> (ccab990) x86_64 x86_64 x86_64 GNU/Linux
>>>>>
>>>>>
>>>>> 3.Not sure which subnet manger is being used - I think OpenSM, but
>>>>> I'll need to check with my administrators.
>>>>>
>>>>>
>>>>> 4.Output of ibv_devinfo is attached.
>>>>>
>>>>>
>>>>> 5.Ifconfig output is attached.
>>>>>
>>>>>
>>>>> 6.Ulimit -l output:
>>>>>
>>>>> [binf102:fischega] $ ulimit -l
>>>>> unlimited
>>>>>
>>>>> Greg
>>>>>
>>>>>
>>>>> *From:*Fischer, Greg A.
>>>>> *Sent:*Tuesday, June 03, 2014 12:38 PM *To:*Open MPI Users
>>>>> *Cc:*Fischer, Greg A.
>>>>> *Subject:*intermittent segfaults with openib on ring_c.c Hello
>>>>> openmpi-users, I'm running into a perplexing problem on a new
>>>>> system, whereby I'm experiencing intermittent segmentation faults
>>>>> when I run the ring_c.c example and use the openib BTL. See an
>>>>> example below. Approximately 50% of the time it provides the
>>>>> expected output, but the other 50% of the time, it segfaults.
>>>>> LD_LIBRARY_PATH is set correctly, and the version of "mpirun" being
>>>>> invoked is correct. The output of ompi_info -all is attached.
>>>>> One potential problem may be that the system that OpenMPI was
>>>>> compiled on is/mostly/the same as the system where it is being
>>>>> executed, but there are some differences in the installed packages.
>>>>> I've checked the critical ones (libibverbs, librdmacm,
>>>>> libmlx4-rdmav2, etc.), and they appear to be the same.
>>>>> Can anyone suggest how I might start tracking this problem down?
>>>>> Thanks,
>>>>> Greg
>>>>> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
>>>>> [binf102:31268] *** Process received signal *** [binf102:31268]
>>>>> Signal: Segmentation fault (11) [binf102:31268] Signal code:
>>>>> Address not mapped (1) [binf102:31268] Failing at address: 0x10
>>>>> [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0)
>>>>> [0x2b42213f57c0] [binf102:31268] [ 1]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.
>>>>> 1(opal_memory_ptmalloc2_int_malloc+0x4b3)
>>>>> [0x2b42203fd7e3]
>>>>> [binf102:31268] [ 2]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.
>>>>> 1(opal_memory_ptmalloc2_int_memalign+0x8b)
>>>>> [0x2b4220400d3b]
>>>>> [binf102:31268] [ 3]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.
>>>>> 1(opal_memory_ptmalloc2_memalign+0x6f)
>>>>> [0x2b42204008ef]
>>>>> [binf102:31268] [ 4]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.
>>>>> 1(+0x117876)
>>>>> [0x2b4220400876]
>>>>> [binf102:31268] [ 5]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mc
>>>>> a_btl_openib.so(+0xc34c)
>>>>> [0x2b422572334c]
>>>>> [binf102:31268] [ 6]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.
>>>>> 1(opal_class_initialize+0xaa)
>>>>> [0x2b422041d64a]
>>>>> [binf102:31268] [ 7]
>>>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mc
>>>>> a_btl_openib.so(+0x1f12f)
>>>>> [0x2b422573612f]
>>>>> [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6)
>>>>> [0x2b42213ed7b6] [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d)
>>>>> [0x2b42216dcd6d] [binf102:31268] *** End of error message ***
>>>>> -------------------------------------------------------------------
>>>>> ------- mpirun noticed that process rank 0 with PID 31268 on node
>>>>> xxxx102 exited on signal 11 (Segmentation fault).
>>>>> -------------------------------------------------------------------
>>>>> -------
>>>>> <ibv_devinfo.txt><ifconfig.txt>____________________________________
>>>>> ___________
>>>>> users mailing list
>>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users