Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Tim Prins (tprins_at_[hidden])
Date: 2007-07-08 12:41:58


On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > While looking into another problem I ran into an issue which made ob1
> > segfault on me. Using gm, and running the test test_dan1 in the onesided
> > test suite, if I limit the gm freelist by too much, I get a segfault.
> > That is,
> >
> > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> >
> > works fine, but
> >
> > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
>
> I cannot, unfortunately, reproduce this with openib BTL.
>
> > segfaults. Here is the relevant output from gdb:
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 1077541088 (LWP 15600)]
> > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
> > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
> > sizeof(mca_pml_ob1_fin_hdr_t));
>
> can you send me what's inside bml_btl?

It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I
fixed this in r15304. But now we hang instead of segfault, and have both
processes just looping through opal_progress. I really don't what to look
for. Any hints?

Thanks,

Tim
 

>
> > (gdb) bt
> > #0 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490,
> > bml_btl=0xd323580, hdr_des=0x9e54e78, order=255 '�', status=1) at
> > pml_ob1.c:267 #1 0x404eef7a in mca_pml_ob1_send_request_put_frag
> > (frag=0xa711f00) at pml_ob1_sendreq.c:1141
> > #2 0x404d986e in mca_pml_ob1_process_pending_rdma () at pml_ob1.c:387
> > #3 0x404eed57 in mca_pml_ob1_put_completion (btl=0x9c37e38,
> > ep=0x9c42c78, des=0xb62ad00, status=0) at pml_ob1_sendreq.c:1108
> > #4 0x404ff520 in mca_btl_gm_put_callback (port=0x9bec5e0,
> > context=0xb62ad00, status=GM_SUCCESS) at btl_gm.c:682
> > #5 0x40512c4f in gm_handle_sent_tokens (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_handle_sent_tokens.c:82
> > #6 0x40517c73 in _gm_unknown (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_unknown.c:222
> > #7 0x405180fc in gm_unknown (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_unknown.c:300
> > #8 0x40502708 in mca_btl_gm_component_progress () at
> > btl_gm_component.c:649 #9 0x404f6fd6 in mca_bml_r2_progress () at
> > bml_r2.c:110
> > #10 0x401a51d3 in opal_progress () at runtime/opal_progress.c:201
> > #11 0x405cf864 in opal_condition_wait (c=0x9e564b8, m=0x9e56478)
> > at ../../../../opal/threads/condition.h:98
> > #12 0x405cf68e in ompi_osc_pt2pt_module_fence (assert=0, win=0x9e55ec8)
> > at osc_pt2pt_sync.c:142
> > #13 0x400b6ebb in PMPI_Win_fence (assert=0, win=0x9e55ec8) at
> > pwin_fence.c:57 #14 0x0804a2f3 in test_bandwidth1 (nbufsize=1050000,
> > min_iterations=10, max_iterations=1000, verbose=0) at test_dan1.c:282
> > #15 0x0804b06f in get_bandwidth (argc=0, argv=0x0) at test_dan1.c:686
> > #16 0x080512f5 in test_dan1 () at test_dan1.c:3555
> > #17 0x08051573 in main (argc=1, argv=0xbfeba9f4) at test_dan1.c:3639
> > (gdb)
> >
> > This is using the trunk. Any ideas?
> >
> > Thanks,
> >
> > Tim
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Gleb.
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel