Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] segv in ompi_info
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2014-07-09 17:30:20


I agree with Gilles that there is not a "bug", but I believe that OMPI
could do something better.

First, I'll show that
a) this is not a new behavior
b) it is not limited to "less".

$ (strace ompi_info -a | grep -m1 btl) 2>&1 | grep -e 'Open MPI:' -e SIGPIPE
write(1, " Open MPI: 1.4.5\n", 32) = 32
--- SIGPIPE (Broken pipe) @ 0 (0) ---
+++ killed by SIGPIPE +++

a) the opmi_info output says "Open MPI: 1.4.5" (thus not new by any
stretch).
b) the "-m1" argument to the inner "grep" says exit after the first match

The "strace" is to detect/report that SIGPIPE was received.
The outer grep picks out the relevant info from the flood of strace output.

So, the "issue" today seems to be that mxm is catching the signal and
producing a backtrace. This backtrace is NOT a desirable behavior. This
is not intrinsically the "fault" of mxm, because there is no reason to
believe that ompi_info would never link to (or dlopen) another library that
performs backtraces.

So, I would suggest that ompi_info simply "signal(SIGPIPE, SIG_IGN);" to
resolve this in a way not specific to mxm.

-Paul

On Wed, Jul 9, 2014 at 3:47 AM, Gilles Gouaillardet <
gilles.gouaillardet_at_[hidden]> wrote:

> Mike,
>
> how do you test ?
> i cannot reproduce a bug :
>
> if you run ompi_info -a -l 9 | less
>
> and i press 'q' at the early stage (e.g. before all output is written to
> the pipe)
> then the less process exits and receives SIG_PIPE and crash (which is a
> normal unix behaviour)
>
> now if i press the spacebar until the end of the output (e.g. i get the
> (END) message from less)
> and then press 'q', then there is no problem.
>
> strace -e signal ompi_info -a -l 9 | true
> will cause ompi_info receives a SIG_PIPE
>
> strace -e signal dd if=/dev/zero bs=1M count=1 | true
> will cause dd receives a SIG_PIPE
>
> unless i miss something, i would conclude there is no bug
>
> Cheers,
>
> Gilles
>
> On 2014/07/09 19:33, Mike Dubman wrote:
>
> mxm only intercept signals and prints the stacktrace.
> happens on trunk as well.
> only when "| less" is used.
>
>
>
>
>
>
> On Tue, Jul 8, 2014 at 4:50 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> <jsquyres_at_[hidden]>
> wrote:
>
>
> I'm unable to replicate. Please provide more detail...? Is this a
> problem in the MXM component?
>
> On Jul 8, 2014, at 9:20 AM, Mike Dubman <miked_at_[hidden]> <miked_at_[hidden]> wrote:
>
>
>
> $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less
> Caught signal 13 (Broken pipe)
> ==== backtrace ====
> 2 0x0000000000054cac mxm_handle_error()
>
> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653
>
> 3 0x0000000000054e74 mxm_error_signal_handler()
>
> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628
>
> 4 0x00000033fbe32920 killpg() ??:0
> 5 0x00000033fbedb650 __write_nocancel() interp.c:0
> 6 0x00000033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0
> 7 0x00000033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0
> 8 0x00000033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0
> 9 0x00000033fbe48410 _IO_vfprintf() ??:0
> 10 0x00000033fbe4f40a printf() ??:0
> 11 0x000000000002bc84 opal_info_out()
>
> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853
>
> 12 0x000000000002c6bb opal_info_show_mca_group_params()
>
> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658
>
> 13 0x000000000002c882 opal_info_show_mca_group_params()
>
> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716
>
> 14 0x000000000002cc13 opal_info_show_mca_params()
>
> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742
>
> 15 0x000000000002d074 opal_info_do_params()
>
> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485
>
> 16 0x000000000040167b main() ??:0
> 17 0x00000033fbe1ecdd __libc_start_main() ??:0
> 18 0x0000000000401349 _start() ??:0
> ===================
> _______________________________________________
> devel mailing listdevel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
>
> http://www.open-mpi.org/community/lists/devel/2014/07/15075.php
>
>
> --
> Jeff Squyresjsquyres_at_[hidden]
> For corporate legal information go to:http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing listdevel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:http://www.open-mpi.org/community/lists/devel/2014/07/15076.php
>
>
>
> _______________________________________________
> devel mailing listdevel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15080.php
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/07/15082.php
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900