Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM initialization race condition
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2008-08-21 10:04:51


George Bosilca wrote:
> Terry,
>
> We use the feature defined by POSIX mmap where the area should be
> zero-filled when the file length is extended. What OS you're using
> when you see such problems ?
>
So far I've only tested this on Solaris. We'll try out the bzero to see
if this goes away.

--td
> Just in case, here is a patch that set the beginning of the mmaped
> region to zero, in case this is not done automatically. As in most
> cases this is an unnecessary overhead, we should find the cases where
> we really need this, and possibly conditionally compile it.
>
> Index: ompi/mca/common/sm/common_sm_mmap.c
> ===================================================================
> --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377)
> +++ ompi/mca/common/sm/common_sm_mmap.c (working copy)
> @@ -163,6 +163,7 @@
>
> /* initialize the segment - only the first process
> to open the file */
> + bzero( map->data_addr, size );
> mem_offset = map->data_addr - (unsigned char
> *)map->map_seg;
> map->map_seg->seg_offset = mem_offset;
> map->map_seg->seg_size = size - mem_offset;
>
> george.
>
> On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote:
>
>> I've been seeing an intermittent (once every 4 hours looping on a
>> quick initialization program) segv with the following stack trace.
>>
>> =>[1] mca_btl_sm_add_procs(btl = 0xfffffd7ffdb67ef0, nprocs = 2U,
>> procs = 0x591560, peers = 0x591580, reachability =
>> 0xfffffd7fffdff000), line 519 in "btl_sm.c"
>> [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints
>> = 0x591500, reachable = 0xfffffd7fffdff000), line 222 in "bml_r2.c"
>> [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in
>> "pml_ob1.c"
>> [4] ompi_mpi_init(argc = 1, argv = 0xfffffd7fffdff318, requested = 0,
>> provided = 0xfffffd7fffdff234), line 651 in "ompi_mpi_init.c"
>> [5] PMPI_Init(argc = 0xfffffd7fffdff2ec, argv = 0xfffffd7fffdff2e0),
>> line 90 in "pinit.c"
>> [6] main(argc = 1, argv = 0xfffffd7fffdff318), line 82 in "buffer.c"
>>
>> I believe the problem is that mca_btl_sm_component.shm_fifo[j]
>> contains uninitialized data causes the loop on line 504 in btl_sm.c
>> to think that a remote rank has set its fifo address.
>>
>> Has anyone else seen the above happening?
>>
>> --td
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>