Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-06-04 12:58:59


He isn't getting that far - he's failing in MPI_Init when the RTE attempts to connect to the local daemon

On Jun 4, 2014, at 9:53 AM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Greg
>
> From your original email:
>
> >> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
>
> This may not fix the problem,
> but have you tried to add the shared memory btl to your mca parameter?
>
> mpirun -np 2 --mca btl openib,sm,self ring_c
>
> As far as I know, sm is the preferred transport layer for intra-node
> communication.
>
> Gus Correa
>
>
> On 06/04/2014 11:13 AM, Ralph Castain wrote:
>> Thanks!! Really appreciate your help - I'll try to figure out what went
>> wrong and get back to you
>>
>> On Jun 4, 2014, at 8:07 AM, Fischer, Greg A. <fischega_at_[hidden]
>> <mailto:fischega_at_[hidden]>> wrote:
>>
>>> I re-ran with 1 processor and got more information. How about this?
>>> Core was generated by `ring_c'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020,
>>> bytes=47592367980728) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>> 4098 bck->fd = unsorted_chunks(av);
>>> (gdb) bt
>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020,
>>> bytes=47592367980728) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>> #1 0x00002b48f2a15e38 in opal_memory_ptmalloc2_malloc
>>> (bytes=47592367980576) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433
>>> #2 0x00002b48f2a15b36 in opal_memory_linux_malloc_hook
>>> (sz=47592367980576, caller=0x2b48f63000b8) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691
>>> #3 0x00002b48f2374b90 in vasprintf () from /lib64/libc.so.6
>>> #4 0x00002b48f2354148 in asprintf () from /lib64/libc.so.6
>>> #5 0x00002b48f26dc7d1 in orte_oob_base_get_addr (uri=0x2b48f6300020)
>>> at ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:234
>>> #6 0x00002b48f53e7d4a in orte_rml_oob_get_uri () at
>>> ../../../../../openmpi-1.8.1/orte/mca/rml/oob/rml_oob_contact.c:36
>>> #7 0x00002b48f26fa181 in orte_routed_base_register_sync (setup=32 '
>>> ') at ../../../../openmpi-1.8.1/orte/mca/routed/base/routed_base_fns.c:301
>>> #8 0x00002b48f4bbcccf in init_routes (job=4130340896,
>>> ndat=0x2b48f63000b8) at
>>> ../../../../../openmpi-1.8.1/orte/mca/routed/binomial/routed_binomial.c:705
>>> #9 0x00002b48f26c615d in orte_ess_base_app_setup
>>> (db_restrict_local=32 ' ') at
>>> ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:245
>>> #10 0x00002b48f45b069f in rte_init () at
>>> ../../../../../openmpi-1.8.1/orte/mca/ess/env/ess_env_module.c:146
>>> #11 0x00002b48f26935ab in orte_init (pargc=0x2b48f6300020,
>>> pargv=0x2b48f63000b8, flags=8) at
>>> ../../openmpi-1.8.1/orte/runtime/orte_init.c:148
>>> #12 0x00002b48f1739d38 in ompi_mpi_init (argc=1, argv=0x7fffebf0d1f8,
>>> requested=8, provided=0x0) at
>>> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:464
>>> #13 0x00002b48f1760a37 in PMPI_Init (argc=0x2b48f6300020,
>>> argv=0x2b48f63000b8) at pinit.c:84
>>> #14 0x00000000004024ef in main (argc=1, argv=0x7fffebf0d1f8) at
>>> ring_c.c:19
>>> *From:*users [mailto:users-bounces_at_[hidden]]*On Behalf Of*Ralph
>>> Castain
>>> *Sent:*Wednesday, June 04, 2014 11:00 AM
>>> *To:*Open MPI Users
>>> *Subject:*Re: [OMPI users] intermittent segfaults with openib on ring_c.c
>>> Does the trace go any further back? Your prior trace seemed to
>>> indicate an error in our OOB framework, but in a very basic place.
>>> Looks like it could be an uninitialized variable, and having the line
>>> number down as deep as possible might help identify the source
>>> On Jun 4, 2014, at 7:55 AM, Fischer, Greg A.
>>> <fischega_at_[hidden] <mailto:fischega_at_[hidden]>> wrote:
>>>
>>>
>>> Oops, ulimit was set improperly. I generated a core file, loaded it in
>>> GDB, and ran a backtrace:
>>> Core was generated by `ring_c'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020,
>>> bytes=47890224382136) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>> 4098 bck->fd = unsorted_chunks(av);
>>> (gdb) bt
>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020,
>>> bytes=47890224382136) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>> #1 0x0000000000000000 in ?? ()
>>> Is that helpful?
>>> Greg
>>> *From:*Fischer, Greg A.
>>> *Sent:*Wednesday, June 04, 2014 10:17 AM
>>> *To:*'Open MPI Users'
>>> *Cc:*Fischer, Greg A.
>>> *Subject:*RE: [OMPI users] intermittent segfaults with openib on ring_c.c
>>> I recompiled with “—enable-debug” but it doesn’t seem to be providing
>>> any more information or a core dump. I’m compiling ring.c with:
>>> mpicc ring_c.c -g -traceback -o ring_c
>>> and running with:
>>> mpirun -np 4 --mca btl openib,self ring_c
>>> and I’m getting:
>>> [binf112:05845] *** Process received signal ***
>>> [binf112:05845] Signal: Segmentation fault (11)
>>> [binf112:05845] Signal code: Address not mapped (1)
>>> [binf112:05845] Failing at address: 0x10
>>> [binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0]
>>> [binf112:05845] [ 1]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03]
>>> [binf112:05845] [ 2]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288]
>>> [binf112:05845] [ 3]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86]
>>> [binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e]
>>> [binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148]
>>> [binf112:05845] [ 6]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2]
>>> [binf112:05845] [ 7]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15]
>>> [binf112:05845] [ 8]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a]
>>> [binf112:05845] [ 9]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d]
>>> [binf112:05845] [10]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b]
>>> [binf112:05845] [11]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d]
>>> [binf112:05845] [12]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_ess_env.so(+0x169f)[0x2b2fa6b8f69f]
>>> [binf112:05845] [13]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x17b)[0x2b2fa4c764bb]
>>> [binf112:05845] [14]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x438)[0x2b2fa3d1e198]
>>> [binf112:05845] [15]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0xf7)[0x2b2fa3d44947]
>>> [binf112:05845] [16] ring_c[0x4024ef]
>>> [binf112:05845] [17]
>>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36]
>>> [binf112:05845] [18] ring_c[0x4023f9]
>>> [binf112:05845] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 3 with PID 5845 on node xxxx112
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> Does any of that help?
>>> Greg
>>> *From:*users [mailto:users-bounces_at_[hidden]]*On Behalf Of*Ralph
>>> Castain
>>> *Sent:*Tuesday, June 03, 2014 11:54 PM
>>> *To:*Open MPI Users
>>> *Subject:*Re: [OMPI users] intermittent segfaults with openib on ring_c.c
>>> Sounds odd - can you configure OMPI --enable-debug and run it again?
>>> If it fails and you can get a core dump, could you tell us the line
>>> number where it is failing?
>>> On Jun 3, 2014, at 9:58 AM, Fischer, Greg A.
>>> <fischega_at_[hidden] <mailto:fischega_at_[hidden]>> wrote:
>>>
>>> Apologies – I forgot to add some of the information requested by the FAQ:
>>>
>>> 1.OpenFabrics is provided by the Linux distribution:
>>>
>>> [binf102:fischega] $ rpm -qa | grep ofed
>>> ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
>>> ofed-1.5.4.1-0.11.5
>>> ofed-doc-1.5.4.1-0.11.5
>>>
>>>
>>> 2.Linux Distro / Kernel:
>>>
>>> [binf102:fischega] $ cat /etc/SuSE-release
>>> SUSE Linux Enterprise Server 11 (x86_64)
>>> VERSION = 11
>>> PATCHLEVEL = 3
>>>
>>> [binf102:fischega] $ uname –a
>>> Linux xxxx102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013
>>> (ccab990) x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> 3.Not sure which subnet manger is being used – I think OpenSM, but
>>> I’ll need to check with my administrators.
>>>
>>>
>>> 4.Output of ibv_devinfo is attached.
>>>
>>>
>>> 5.Ifconfig output is attached.
>>>
>>>
>>> 6.Ulimit –l output:
>>>
>>> [binf102:fischega] $ ulimit –l
>>> unlimited
>>>
>>> Greg
>>>
>>>
>>> *From:*Fischer, Greg A.
>>> *Sent:*Tuesday, June 03, 2014 12:38 PM
>>> *To:*Open MPI Users
>>> *Cc:*Fischer, Greg A.
>>> *Subject:*intermittent segfaults with openib on ring_c.c
>>> Hello openmpi-users,
>>> I’m running into a perplexing problem on a new system, whereby I’m
>>> experiencing intermittent segmentation faults when I run the ring_c.c
>>> example and use the openib BTL. See an example below. Approximately
>>> 50% of the time it provides the expected output, but the other 50% of
>>> the time, it segfaults. LD_LIBRARY_PATH is set correctly, and the
>>> version of “mpirun” being invoked is correct. The output of ompi_info
>>> –all is attached.
>>> One potential problem may be that the system that OpenMPI was compiled
>>> on is/mostly/the same as the system where it is being executed, but
>>> there are some differences in the installed packages. I’ve checked the
>>> critical ones (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they
>>> appear to be the same.
>>> Can anyone suggest how I might start tracking this problem down?
>>> Thanks,
>>> Greg
>>> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
>>> [binf102:31268] *** Process received signal ***
>>> [binf102:31268] Signal: Segmentation fault (11)
>>> [binf102:31268] Signal code: Address not mapped (1)
>>> [binf102:31268] Failing at address: 0x10
>>> [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0]
>>> [binf102:31268] [ 1]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
>>> [0x2b42203fd7e3]
>>> [binf102:31268] [ 2]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b)
>>> [0x2b4220400d3b]
>>> [binf102:31268] [ 3]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f)
>>> [0x2b42204008ef]
>>> [binf102:31268] [ 4]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876)
>>> [0x2b4220400876]
>>> [binf102:31268] [ 5]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c)
>>> [0x2b422572334c]
>>> [binf102:31268] [ 6]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa)
>>> [0x2b422041d64a]
>>> [binf102:31268] [ 7]
>>> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f)
>>> [0x2b422573612f]
>>> [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6]
>>> [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d]
>>> [binf102:31268] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 31268 on node xxxx102
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> <ibv_devinfo.txt><ifconfig.txt>_______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users