Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Tim Prins (tprins_at_[hidden])
Date: 2007-07-08 12:41:58


On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > While looking into another problem I ran into an issue which made ob1
> > segfault on me. Using gm, and running the test test_dan1 in the onesided
> > test suite, if I limit the gm freelist by too much, I get a segfault.
> > That is,
> >
> > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> >
> > works fine, but
> >
> > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
>
> I cannot, unfortunately, reproduce this with openib BTL.
>
> > segfaults. Here is the relevant output from gdb:
> >
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 1077541088 (LWP 15600)]
> > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
> > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
> > sizeof(mca_pml_ob1_fin_hdr_t));
>
> can you send me what's inside bml_btl?

It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I
fixed this in r15304. But now we hang instead of segfault, and have both
processes just looping through opal_progress. I really don't what to look
for. Any hints?

Thanks,

Tim
 

>
> > (gdb) bt
> > #0 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490,
> > bml_btl=0xd323580, hdr_des=0x9e54e78, order=255 '�', status=1) at
> > pml_ob1.c:267 #1 0x404eef7a in mca_pml_ob1_send_request_put_frag
> > (frag=0xa711f00) at pml_ob1_sendreq.c:1141
> > #2 0x404d986e in mca_pml_ob1_process_pending_rdma () at pml_ob1.c:387
> > #3 0x404eed57 in mca_pml_ob1_put_completion (btl=0x9c37e38,
> > ep=0x9c42c78, des=0xb62ad00, status=0) at pml_ob1_sendreq.c:1108
> > #4 0x404ff520 in mca_btl_gm_put_callback (port=0x9bec5e0,
> > context=0xb62ad00, status=GM_SUCCESS) at btl_gm.c:682
> > #5 0x40512c4f in gm_handle_sent_tokens (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_handle_sent_tokens.c:82
> > #6 0x40517c73 in _gm_unknown (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_unknown.c:222
> > #7 0x405180fc in gm_unknown (p=0x9bec5e0, e=0x406189c0)
> > at ./libgm/gm_unknown.c:300
> > #8 0x40502708 in mca_btl_gm_component_progress () at
> > btl_gm_component.c:649 #9 0x404f6fd6 in mca_bml_r2_progress () at
> > bml_r2.c:110
> > #10 0x401a51d3 in opal_progress () at runtime/opal_progress.c:201
> > #11 0x405cf864 in opal_condition_wait (c=0x9e564b8, m=0x9e56478)
> > at ../../../../opal/threads/condition.h:98
> > #12 0x405cf68e in ompi_osc_pt2pt_module_fence (assert=0, win=0x9e55ec8)
> > at osc_pt2pt_sync.c:142
> > #13 0x400b6ebb in PMPI_Win_fence (assert=0, win=0x9e55ec8) at
> > pwin_fence.c:57 #14 0x0804a2f3 in test_bandwidth1 (nbufsize=1050000,
> > min_iterations=10, max_iterations=1000, verbose=0) at test_dan1.c:282
> > #15 0x0804b06f in get_bandwidth (argc=0, argv=0x0) at test_dan1.c:686
> > #16 0x080512f5 in test_dan1 () at test_dan1.c:3555
> > #17 0x08051573 in main (argc=1, argv=0xbfeba9f4) at test_dan1.c:3639
> > (gdb)
> >
> > This is using the trunk. Any ideas?
> >
> > Thanks,
> >
> > Tim
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Gleb.
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel