Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] segv in ompi_info
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-07-09 18:42:04


Good suggestion - fixed on trunk in r32189

On Jul 9, 2014, at 2:30 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> I agree with Gilles that there is not a "bug", but I believe that OMPI could do something better.
>
> First, I'll show that
> a) this is not a new behavior
> b) it is not limited to "less".
>
> $ (strace ompi_info -a | grep -m1 btl) 2>&1 | grep -e 'Open MPI:' -e SIGPIPE
> write(1, " Open MPI: 1.4.5\n", 32) = 32
> --- SIGPIPE (Broken pipe) @ 0 (0) ---
> +++ killed by SIGPIPE +++
>
> a) the opmi_info output says "Open MPI: 1.4.5" (thus not new by any stretch).
> b) the "-m1" argument to the inner "grep" says exit after the first match
>
> The "strace" is to detect/report that SIGPIPE was received.
> The outer grep picks out the relevant info from the flood of strace output.
>
> So, the "issue" today seems to be that mxm is catching the signal and producing a backtrace. This backtrace is NOT a desirable behavior. This is not intrinsically the "fault" of mxm, because there is no reason to believe that ompi_info would never link to (or dlopen) another library that performs backtraces.
>
> So, I would suggest that ompi_info simply "signal(SIGPIPE, SIG_IGN);" to resolve this in a way not specific to mxm.
>
> -Paul
>
>
> On Wed, Jul 9, 2014 at 3:47 AM, Gilles Gouaillardet <gilles.gouaillardet_at_[hidden]> wrote:
> Mike,
>
> how do you test ?
> i cannot reproduce a bug :
>
> if you run ompi_info -a -l 9 | less
>
> and i press 'q' at the early stage (e.g. before all output is written to the pipe)
> then the less process exits and receives SIG_PIPE and crash (which is a normal unix behaviour)
>
> now if i press the spacebar until the end of the output (e.g. i get the (END) message from less)
> and then press 'q', then there is no problem.
>
> strace -e signal ompi_info -a -l 9 | true
> will cause ompi_info receives a SIG_PIPE
>
> strace -e signal dd if=/dev/zero bs=1M count=1 | true
> will cause dd receives a SIG_PIPE
>
> unless i miss something, i would conclude there is no bug
>
> Cheers,
>
> Gilles
>
> On 2014/07/09 19:33, Mike Dubman wrote:
>> mxm only intercept signals and prints the stacktrace.
>> happens on trunk as well.
>> only when "| less" is used.
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 8, 2014 at 4:50 PM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]>
>> wrote:
>>
>>> I'm unable to replicate. Please provide more detail...? Is this a
>>> problem in the MXM component?
>>>
>>> On Jul 8, 2014, at 9:20 AM, Mike Dubman <miked_at_[hidden]> wrote:
>>>
>>>> $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less
>>>> Caught signal 13 (Broken pipe)
>>>> ==== backtrace ====
>>>> 2 0x0000000000054cac mxm_handle_error()
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653
>>>> 3 0x0000000000054e74 mxm_error_signal_handler()
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628
>>>> 4 0x00000033fbe32920 killpg() ??:0
>>>> 5 0x00000033fbedb650 __write_nocancel() interp.c:0
>>>> 6 0x00000033fbe71d53 _IO_file_write@@GLIBC_2.2.5() ??:0
>>>> 7 0x00000033fbe73305 _IO_do_write@@GLIBC_2.2.5() ??:0
>>>> 8 0x00000033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5() ??:0
>>>> 9 0x00000033fbe48410 _IO_vfprintf() ??:0
>>>> 10 0x00000033fbe4f40a printf() ??:0
>>>> 11 0x000000000002bc84 opal_info_out()
>>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853
>>>> 12 0x000000000002c6bb opal_info_show_mca_group_params()
>>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658
>>>> 13 0x000000000002c882 opal_info_show_mca_group_params()
>>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716
>>>> 14 0x000000000002cc13 opal_info_show_mca_params()
>>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742
>>>> 15 0x000000000002d074 opal_info_do_params()
>>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485
>>>> 16 0x000000000040167b main() ??:0
>>>> 17 0x00000033fbe1ecdd __libc_start_main() ??:0
>>>> 18 0x0000000000401349 _start() ??:0
>>>> ===================
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/07/15075.php
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/07/15076.php
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15080.php
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15082.php
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15085.php