Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] yesterday commits caused a crash in helloworld with --mca btl tcp, self
From: Hjelm, Nathan T (hjelmn_at_[hidden])
Date: 2014-05-16 16:19:50


I am not seeing this. Maybe it is something exposed by the fact we actually call del_procs correctly now. I will try to take a look over the weekend.

-Nathan

________________________________________
From: devel [devel-bounces_at_[hidden]] on behalf of Thomas Naughton [naughtont_at_[hidden]]
Sent: Friday, May 16, 2014 11:43 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] yesterday commits caused a crash in helloworld with --mca btl tcp, self

Hi,

I'm also seeing some sporadic failures with recent commits to trunk.
My tests are using slightly different build/configuration, and use
a different rte, but the errors are coming from the OMPI ob1 layer.

  works: r31777 (I did not test r31778..r31783)
  fails: r31784M (plus manually applied patch from r31786)

My test was something simple:
     cd examples/
     mpicc -g hello_c.c -o hello_c
     mpirun -np 10 hello_c

Again it is sporadic, I was able to reproduce the failure with different
values of '-np' > 1; sometimes np=3, other times np=11.

Here's some backtrace / debug info...

Program terminated with signal 11, Segmentation fault.
[New process 7242]
[New process 7255]
#0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
     btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
139 if( array->bml_btls[i].btl == btl ) {
(gdb) bt
#0 0xb7a7569f in mca_bml_base_btl_array_remove (array=0x81049ec,
     btl=0xb7a721c0) at ../../../../ompi/mca/bml/bml.h:139
#1 0xb7a7539f in mca_bml_r2_del_proc_btl (proc=0x80debe8, btl=0xb7a721c0)
     at bml_r2.c:551
#2 0xb7a757d8 in mca_bml_r2_finalize () at bml_r2.c:648
#3 0xb70c50b8 in mca_pml_ob1_component_fini () at pml_ob1_component.c:290
#4 0xb7f5a755 in mca_pml_v_component_parasite_finalize ()
     at pml_v_component.c:161
#5 0xb7f58c63 in mca_pml_base_finalize () at base/pml_base_frame.c:120
#6 0xb7ec81e1 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:291
#7 0xb7ef1042 in PMPI_Finalize () at pfinalize.c:46
#8 0x0804874d in main (argc=2, argv=0xbfc8d394) at hello_c.c:24

(gdb) p array->bml_btls
$1 = (mca_bml_base_btl_t *) 0x0
(gdb) p btl
$2 = (struct mca_btl_base_module_t *) 0xb7a721c0
(gdb) p *btl
$3 = {btl_component = 0xb7a72240, btl_eager_limit = 131072,
   btl_rndv_eager_limit = 131072, btl_max_send_size = 262144,
   btl_rdma_pipeline_send_length = 2147483647,
   btl_rdma_pipeline_frag_size = 2147483647,
   btl_min_rdma_pipeline_size = 2147614719, btl_exclusivity = 65536,
   btl_latency = 0, btl_bandwidth = 100, btl_flags = 10, btl_seg_size = 16,
   btl_add_procs = 0xb7a6fd9c <mca_btl_self_add_procs>,
   btl_del_procs = 0xb7a6fdf9 <mca_btl_self_del_procs>, btl_register = 0,
   btl_finalize = 0xb7a6fe03 <mca_btl_self_finalize>,
   btl_alloc = 0xb7a6fe0d <mca_btl_self_alloc>,
   btl_free = 0xb7a70074 <mca_btl_self_free>,
   btl_prepare_src = 0xb7a70329 <mca_btl_self_prepare_src>,
   btl_prepare_dst = 0xb7a70702 <mca_btl_self_prepare_dst>,
   btl_send = 0xb7a70831 <mca_btl_self_send>, btl_sendi = 0,
   btl_put = 0xb7a70910 <mca_btl_self_rdma>,
   btl_get = 0xb7a70910 <mca_btl_self_rdma>,
   btl_dump = 0xb7f35b57 <mca_btl_base_dump>, btl_mpool = 0x0,
   btl_register_error = 0, btl_ft_event = 0xb7a70b00
<mca_btl_self_ft_event>}
(gdb) l
134 struct
mca_btl_base_module_t* btl )
135 {
136 size_t i = 0;
137 /* find the btl */
138 for( i = 0; i < array->arr_size; i++ ) {
139 if( array->bml_btls[i].btl == btl ) {
140 /* make sure not to go out of bounds */
141 for( ; i < array->arr_size-1; i++ ) {
142 /* move all btl's back by 1, so the found
143 btl is "removed" */
(gdb) p array->arr_size
$4 = 69
(gdb) p array->bml_btls
$5 = (mca_bml_base_btl_t *) 0x0

Anyone else seeing problems?
--tjn

  _________________________________________________________________________
   Thomas Naughton naughtont_at_[hidden]
   Research Associate (865) 576-4184

On Fri, 16 May 2014, Gilles Gouaillardet wrote:

> Folks,
>
> a simple
> mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld
>
> crashes after some of yesterday's commits (i would blame r31778 and/or
> r31782,
> but i am not 100% sure)
>
> /* a list receives a negative value, so the program takes some time
> before crashing,
> symptom may vary from one system to an other */
>
> i digged into this, and found what looks like an old bug/typo in
> mca_bml_r2_del_procs().
> the bug has *not* been introduced by yesterday commits.
> i believe this path was not executed since yesterday, that is why we
> (only) now hit the bug
>
> i fixed this in r31786
>
> Gilles
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14814.php
>
_______________________________________________
devel mailing list
devel_at_[hidden]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14819.php