Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-09 17:57:36


Not sure I grok - are you saying you believe the assert is bogus? We haven't see it elsewhere, but perhaps this is happening only with c/r config and running?

I'm happy to take a look if you can provide more specifics as to how it can be made to happen

On Jan 9, 2014, at 2:46 PM, Adrian Reber <adrian_at_[hidden]> wrote:

> For my CR work this can probably ignored. I think I was looking at the
> wrong place.
>
> On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote:
>> Continuing with the CR code I now get a crash which can be easily reproduced
>> using orte/test/system/orte_barrier.c
>>
>> I get:
>>
>> orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed.
>> [dcbz:05085] *** Process received signal ***
>> [dcbz:05085] Signal: Aborted (6)
>> [dcbz:05085] Signal code: (-6)
>> [dcbz:05085] [ 0] /lib64/libpthread.so.0(+0xf750)[0x7f95bca0b750]
>> [dcbz:05085] [ 1] /lib64/libc.so.6(gsignal+0x39)[0x7f95bc672c59]
>> [dcbz:05085] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95bc674368]
>> [dcbz:05085] [ 3] /lib64/libc.so.6(+0x2ebb6)[0x7f95bc66bbb6]
>> [dcbz:05085] [ 4] /lib64/libc.so.6(+0x2ec62)[0x7f95bc66bc62]
>> [dcbz:05085] [ 5] /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86975)[0x7f95bcfbd975]
>> [dcbz:05085] [ 6] /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x86d9a)[0x7f95bcfbdd9a]
>> [dcbz:05085] [ 7] /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8c831)[0x7f95bcca5831]
>> [dcbz:05085] [ 8] /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(+0x8caa3)[0x7f95bcca5aa3]
>> [dcbz:05085] [ 9] /home/adrian/devel/openmpi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x2c1)[0x7f95bcca611f]
>> [dcbz:05085] [10] /home/adrian/devel/openmpi-trunk/lib/libopen-rte.so.0(+0x2233b)[0x7f95bcf5933b]
>> [dcbz:05085] [11] /lib64/libpthread.so.0(+0x7f33)[0x7f95bca03f33]
>> [dcbz:05085] [12] /lib64/libc.so.6(clone+0x6d)[0x7f95bc731ead]
>> [dcbz:05085] *** End of error message ***
>> --------------------------------------------------------------------------
>> orterun noticed that process rank 0 with PID 5085 on node dcbz exited on signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>> and in gdb
>>
>> [New LWP 5086]
>> [New LWP 5085]
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib64/libthread_db.so.1".
>> Core was generated by `system/orte_barrier'.
>> Program terminated with signal SIGABRT, Aborted.
>> #0 0x00007f95bc672c59 in __GI_raise (sig=sig_at_entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
>> (gdb) bt
>> #0 0x00007f95bc672c59 in __GI_raise (sig=sig_at_entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> #1 0x00007f95bc6744a8 in __GI_abort () at abort.c:118
>> #2 0x00007f95bc66bbb6 in __assert_fail_base (fmt=0x7f95bc7b8ea8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
>> assertion=assertion_at_entry=0x7f95bd06d6c0 "0 == item->opal_list_item_refcount",
>> file=file_at_entry=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=line_at_entry=547,
>> function=function_at_entry=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") at assert.c:92
>> #3 0x00007f95bc66bc62 in __GI___assert_fail (assertion=0x7f95bd06d6c0 "0 == item->opal_list_item_refcount",
>> file=0x7f95bd06d600 "../../../../../opal/class/opal_list.h", line=547,
>> function=0x7f95bd06d9d0 <__PRETTY_FUNCTION__.4605> "_opal_list_append") at assert.c:101
>> #4 0x00007f95bcfbd975 in _opal_list_append (list=0x7f95bd2b9408 <orte_grpcomm_base+8>, item=0x1f35be0,
>> FILE_NAME=0x7f95bd06d718 "../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c", LINENO=163)
>> at ../../../../../opal/class/opal_list.h:547
>> #5 0x00007f95bcfbdd9a in process_barrier (fd=-1, args=4, cbdata=0x1f35ed0) at ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
>> #6 0x00007f95bcca5831 in event_process_active_single_queue (base=0x1ef63a0, activeq=0x1ef6360)
>> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
>> #7 0x00007f95bcca5aa3 in event_process_active (base=0x1ef63a0) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
>> #8 0x00007f95bcca611f in opal_libevent2021_event_base_loop (base=0x1ef63a0, flags=1)
>> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
>> #9 0x00007f95bcf5933b in orte_progress_thread_engine (obj=0x7f95bd2b9160 <orte_progress_thread>) at ../../orte/runtime/orte_init.c:180
>> #10 0x00007f95bca03f33 in start_thread (arg=0x7f95bbb0d700) at pthread_create.c:309
>> #11 0x00007f95bc731ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>> (gdb)
>>
>> As far as I understand it seems to call opal_list_append() twice in
>> orte/mca/grpcomm/bad/grpcomm_bad_module.c:163
>>
>> opal_list_append(&orte_grpcomm_base.active_colls, &coll->super);
>>
>> I have no idea how to fix this.
>>
>> Adrian
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel