Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Intermittent mpirun crash?
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2014-01-30 14:53:28


I ran mpirun through valgrind and I got some strange complaints about an issue with thread 2. I hunted around mpirun code and I see that we start a thread, but we never have it finish during shutdown. Therefore, I added this snippet of code (probably in the wrong place) and I no longer see my intermittent crashes.

Ralph, what do you think? Does this seem reasonable?

Rolf

[rvandevaart_at_drossetti-ivy0 ompi-v1.7]$ svn diff
Index: orte/mca/oob/tcp/oob_tcp_component.c
===================================================================
--- orte/mca/oob/tcp/oob_tcp_component.c (revision 30500)
+++ orte/mca/oob/tcp/oob_tcp_component.c (working copy)
@@ -631,6 +631,10 @@
     opal_output_verbose(2, orte_oob_base_framework.framework_output,
                         "%s TCP SHUTDOWN",
                         ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
+ if (ORTE_PROC_IS_HNP) {
+ mca_oob_tcp_component.listen_thread_active = 0;
+ opal_thread_join(&mca_oob_tcp_component.listen_thread, NULL);
+ }
 
     while (NULL != (item = opal_list_remove_first(&mca_oob_tcp_component.listeners))) {
         OBJ_RELEASE(item);

>-----Original Message-----
>From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Ralph
>Castain
>Sent: Thursday, January 30, 2014 12:35 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] Intermittent mpirun crash?
>
>That option might explain why your test process is failing (which segfaulted as
>well), but obviously wouldn't have anything to do with mpirun
>
>On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart <rvandevaart_at_[hidden]>
>wrote:
>
>> I just retested with --mca mpi_leave_pinned 0 and that made no difference.
>I still see the mpirun crash.
>>
>>> -----Original Message-----
>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>>> Bosilca
>>> Sent: Thursday, January 30, 2014 11:59 AM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] Intermittent mpirun crash?
>>>
>>> I got something similar 2 days ago, with a large software package
>>> abusing of MPI_Waitany/MPI_Waitsome (that was working seamlessly a
>>> month ago). I had to find a quick fix. Upon figuring out that turning
>>> the leave_pinned off fixes the problem, I did not investigate any further.
>>>
>>> Do you see a similar behavior?
>>>
>>> George.
>>>
>>> On Jan 30, 2014, at 17:26 , Rolf vandeVaart <rvandevaart_at_[hidden]>
>wrote:
>>>
>>>> I am seeing this happening to me very intermittently. Looks like
>>>> mpirun is
>>> getting a SEGV. Is anyone else seeing this?
>>>> This is 1.7.4 built yesterday. (Note that I added some stuff to
>>>> what is being printed out so the message is slightly different than
>>>> 1.7.4
>>>> output)
>>>>
>>>> mpirun - -np 6 -host
>>>> drossetti-ivy0,drossetti-ivy1,drossetti-ivy2,drossetti-ivy3 --mca
>>>> btl_openib_warn_default_gid_prefix 0 -- `pwd`/src/MPI_Waitsome_p_c
>>>> MPITEST info (0): Starting: MPI_Waitsome_p: Persistent Waitsome
>>>> using two nodes
>>>> MPITEST_results: MPI_Waitsome_p: Persistent Waitsome using two
>>>> nodes all tests PASSED (742) [drossetti-ivy0:10353] *** Process
>>>> (mpirun)received signal *** [drossetti-ivy0:10353] Signal:
>>>> Segmentation fault (11) [drossetti-ivy0:10353] Signal code: Address
>>>> not mapped (1) [drossetti-ivy0:10353] Failing at address:
>>>> 0x7fd31e5f208d [drossetti-ivy0:10353] End of signal information -
>>>> not sleeping
>>>> gmake[1]: *** [MPI_Waitsome_p_c] Segmentation fault (core dumped)
>>>> gmake[1]: Leaving directory `/geppetto/home/rvandevaart/public/ompi-
>>> tests/trunk/intel_tests'
>>>>
>>>> (gdb) where
>>>> #0 0x00007fd31f620807 in ?? () from /lib64/libgcc_s.so.1
>>>> #1 0x00007fd31f6210b9 in _Unwind_Backtrace () from
>>>> /lib64/libgcc_s.so.1
>>>> #2 0x00007fd31fb2893e in backtrace () from /lib64/libc.so.6
>>>> #3 0x00007fd320b0d622 in opal_backtrace_buffer
>>> (message_out=0x7fd31e5e33a0, len_out=0x7fd31e5e33ac)
>>>> at
>>>> ../../../../../opal/mca/backtrace/execinfo/backtrace_execinfo.c:57
>>>> #4 0x00007fd320b0a794 in show_stackframe (signo=11,
>>>> info=0x7fd31e5e3930, p=0x7fd31e5e3800) at
>>>> ../../../opal/util/stacktrace.c:354
>>>> #5 <signal handler called>
>>>> #6 0x00007fd31e5f208d in ?? ()
>>>> #7 0x00007fd31e5e46d8 in ?? ()
>>>> #8 0x000000000000c2a8 in ?? ()
>>>> #9 0x0000000000000000 in ?? ()
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> --
>>>> ------------- This email message is for the sole use of the intended
>>>> recipient(s) and may contain confidential information. Any
>>>> unauthorized review, use, disclosure or distribution is prohibited.
>>>> If you are not the intended recipient, please contact the sender by
>>>> reply email and destroy all copies of the original message.
>>>> --------------------------------------------------------------------
>>>> --
>>>> ------------- _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>_______________________________________________
>devel mailing list
>devel_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/devel