Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Tim Prins (tprins_at_[hidden])
Date: 2007-07-09 17:56:45


On Monday 09 July 2007 02:04:33 pm Gleb Natapov wrote:
> On Mon, Jul 09, 2007 at 10:41:52AM -0400, Tim Prins wrote:
> > Gleb Natapov wrote:
> > > On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:
> > >> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> > >>> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > >>>> While looking into another problem I ran into an issue which made
> > >>>> ob1 segfault on me. Using gm, and running the test test_dan1 in the
> > >>>> onesided test suite, if I limit the gm freelist by too much, I get a
> > >>>> segfault. That is,
> > >>>>
> > >>>> mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024
> > >>>> test_dan1
> > >>>>
> > >>>> works fine, but
> > >>>>
> > >>>> mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512
> > >>>> test_dan1
> > >>>
> > >>> I cannot, unfortunately, reproduce this with openib BTL.
> > >>>
> > >>>> segfaults. Here is the relevant output from gdb:
> > >>>>
> > >>>> Program received signal SIGSEGV, Segmentation fault.
> > >>>> [Switching to Thread 1077541088 (LWP 15600)]
> > >>>> 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490,
> > >>>> bml_btl=0xd323580, hdr_des=0x9e54e78, order=255 '�', status=1) at
> > >>>> pml_ob1.c:267 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
> > >>>> sizeof(mca_pml_ob1_fin_hdr_t));
> > >>>
> > >>> can you send me what's inside bml_btl?
> > >>
> > >> It turns out that the order of arguments to mca_pml_ob1_send_fin was
> > >> wrong. I fixed this in r15304. But now we hang instead of segfault,
> > >> and have both processes just looping through opal_progress. I really
> > >> don't what to look for. Any hints?
> > >
> > > Can you look in gdb at mca_pml_ob1.rdma_pending?
> >
> > Yeah, rank 0 has nothing on the list, and rank 1 has 48 things.
>
> Do you run both ranks on the same node? Can you try to run them on
> different node?
>
I was running on one node, but running on different nodes leads to the same
result.

Tim