Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-09-03 15:21:35


Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Dang - I just finished running it on odin without a problem. Are you seeing this with a debug or optimized build?
>
>
> On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
>
>> Yes, it fails on the current trunk (r29112). That is what started me on the journey to figure out when things went wrong. It was working up until r29058.
>>
>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Ralph Castain
>> Sent: Tuesday, September 03, 2013 2:49 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>>
>> Are you all the way up to the current trunk? There have been a few typo fixes since the original commit.
>>
>> I'm not familiar with the OOB connect code in openib. The OOB itself isn't using free list, so I suspect it is something up in the OOB connect code itself. I'll take a look and see if something leaps out at me - it seems to be working fine on IU's odin cluster, which is the only IB-based system I can access
>>
>>
>> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
>>
>>
>> As mentioned in the weekly conference call, I am seeing some strange errors when using the openib BTL. I have narrowed down the changeset that broke things to the ORTE async code.
>>
>> https://svn.open-mpi.org/trac/ompi/changeset/29058 (and https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix compile errors)
>>
>> Changeset 29057 does not have these issues. I do not have a very good characterization of the failures. The failures are not consistent. Sometimes they can pass. Sometimes the stack trace can be different. They seem to happen more with larger np, like np=4 and more.
>>
>> The first failure mode is a segmentation violation and it always seems to be that we are trying to pop something of a free list. But the upper parts of the stack trace can vary. This is with the trunk version 29061.
>> Ralph, any thoughts on where we go from here?
>>
>> [rolf_at_Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 MPI_Irecv_comm_c
>> MPITEST info (0): Starting: MPI_Irecv_comm:
>> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) [compute-0-4:04752] Failing at address: 0x28
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> [rolf_at_Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> Core was generated by `MPI_Irecv_comm_c'.
>> Program terminated with signal 11, Segmentation fault.
>> [New process 4753]
>> [New process 4756]
>> [New process 4755]
>> [New process 4754]
>> [New process 4752]
>> #0 0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at ../../../../../opal/class/opal_atomic_lifo.h:111
>> 111 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
>> (gdb) where
>> #0 0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at ../../../../../opal/class/opal_atomic_lifo.h:111
>> #1 0x00002aaaad6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
>> #2 0x00002aaaad6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
>> #3 0x00002aaaad6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, qp=0)
>> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
>> #4 0x00002aaaad6ebfad in mca_btl_openib_endpoint_post_recvs (endpoint=0x59f3120)
>> at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
>> #5 0x00002aaaad6fe71c in qp_create_all (endpoint=0x59f3120) at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
>> #6 0x00002aaaad6fde2b in reply_start_connect (endpoint=0x59f3120, rem_info=0x40ea8ed0)
>> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
>> #7 0x00002aaaad7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, buffer=0x40ea8f80, tag=102, cbdata=0x0)
>> at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
>> #8 0x00002ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, cbdata=0x5b0bac0)
>> at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
>> #9 0x00002ae8027164a1 in event_process_active_single_queue (base=0x58ac620, activeq=0x58aa5b0)
>> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
>> #10 0x00002ae802716b24 in event_process_active (base=0x58ac620) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
>> #11 0x00002ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, flags=1)
>> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
>> #12 0x00002ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at ../../orte/runtime/orte_init.c:180
>> #13 0x0000003ab1e06367 in start_thread () from /lib64/libpthread.so.0
>> #14 0x0000003ab16d2f7d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>> This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>