Open MPI's signal handler (show_stackframe function defined in
opal/util/stacktrace.c) calls non-async-signal-safe functions
and it causes a problem.
See attached mpisigabrt.c. Passing corrupted memory to realloc(3)
will cause SIGABRT and show_stackframe function will be invoked.
But invoked show_stackframe function deadlocks in backtrace_symbols(3)
on some systems because backtrace_symbols(3) calls malloc(3)
internally and a deadlock of realloc/malloc mutex occurs.
Attached mpisigabrt.gstack.txt shows the stacktrace gotten
by gdb in this deadlock situation on Ubuntu 12.04 LTS (precise)
x86_64. Though I could not reproduce this behavior on RHEL 5/6,
I can reproduce it also on K computer and its successor PRIMEHPC FX10.
Passing non-heap memory to free(3) and double-free also cause
malloc (and backtrace_symbols) is not marked as async-signal-safe
in POSIX and current glibc, though it seems to have been marked
in old glibc. So we should not call it in the signal handler now.
I wrote a patch to address this issue. See the attached
This patch calls backtrace_symbols_fd(3) instead of backtrace_symbols(3).
Though backtrace_symbols_fd is not declared as async-signal-safe,
it is described not to call malloc internally in its man. So it
should be rather safer.
Output format of show_stackframe function is not changed by
this patch. But the opal_backtrace_print function (backtrace
framework) interface is changed for the output format compatibility.
This requires changes in some additional files (ompi_mpi_abort.c
This patch also removes unnecessary fflush(3) calls, which are
meaningless for write(2) system call but might cause a similar
What do you think about this patch?
MPI development team,