Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] openib BTL problems with ORTE async changes
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2013-09-03 14:34:09

As mentioned in the weekly conference call, I am seeing some strange errors when using the openib BTL. I have narrowed down the changeset that broke things to the ORTE async code. (and which was needed to fix compile errors)

Changeset 29057 does not have these issues. I do not have a very good characterization of the failures. The failures are not consistent. Sometimes they can pass. Sometimes the stack trace can be different. They seem to happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be that we are trying to pop something of a free list. But the upper parts of the stack trace can vary. This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf_at_Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 MPI_Irecv_comm_c
MPITEST info (0): Starting: MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) [compute-0-4:04752] Failing at address: 0x28
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 11 (Segmentation fault).
[rolf_at_Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
Core was generated by `MPI_Irecv_comm_c'.
Program terminated with signal 11, Segmentation fault.
[New process 4753]
[New process 4756]
[New process 4755]
[New process 4754]
[New process 4752]
#0 0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at ../../../../../opal/class/opal_atomic_lifo.h:111
111 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
(gdb) where
#0 0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at ../../../../../opal/class/opal_atomic_lifo.h:111
#1 0x00002aaaad6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
#2 0x00002aaaad6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
#3 0x00002aaaad6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, qp=0)
    at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
#4 0x00002aaaad6ebfad in mca_btl_openib_endpoint_post_recvs (endpoint=0x59f3120)
    at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
#5 0x00002aaaad6fe71c in qp_create_all (endpoint=0x59f3120) at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
#6 0x00002aaaad6fde2b in reply_start_connect (endpoint=0x59f3120, rem_info=0x40ea8ed0)
    at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
#7 0x00002aaaad7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, buffer=0x40ea8f80, tag=102, cbdata=0x0)
    at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
#8 0x00002ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, cbdata=0x5b0bac0)
    at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
#9 0x00002ae8027164a1 in event_process_active_single_queue (base=0x58ac620, activeq=0x58aa5b0)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
#10 0x00002ae802716b24 in event_process_active (base=0x58ac620) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#11 0x00002ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, flags=1)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#12 0x00002ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at ../../orte/runtime/orte_init.c:180
#13 0x0000003ab1e06367 in start_thread () from /lib64/
#14 0x0000003ab16d2f7d in clone () from /lib64/

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.