Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM initialization race condition
From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2008-08-21 09:36:11


bzero is not a gnu-ism -- it's in POSIX.1. Either bzero or memset is
correct and used throughout OMPI.

Brian

On Thu, 21 Aug 2008, Jeff Squyres wrote:

> IIRC, bzero is a gnu-ism. We should probably use memset instead.
>
>
> On Aug 21, 2008, at 5:40 AM, George Bosilca wrote:
>
>> Terry,
>>
>> We use the feature defined by POSIX mmap where the area should be
>> zero-filled when the file length is extended. What OS you're using when you
>> see such problems ?
>>
>> Just in case, here is a patch that set the beginning of the mmaped region
>> to zero, in case this is not done automatically. As in most cases this is
>> an unnecessary overhead, we should find the cases where we really need
>> this, and possibly conditionally compile it.
>>
>> Index: ompi/mca/common/sm/common_sm_mmap.c
>> ===================================================================
>> --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377)
>> +++ ompi/mca/common/sm/common_sm_mmap.c (working copy)
>> @@ -163,6 +163,7 @@
>>
>> /* initialize the segment - only the first process
>> to open the file */
>> + bzero( map->data_addr, size );
>> mem_offset = map->data_addr - (unsigned char *)map->map_seg;
>> map->map_seg->seg_offset = mem_offset;
>> map->map_seg->seg_size = size - mem_offset;
>>
>> george.
>>
>> On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote:
>>
>>> I've been seeing an intermittent (once every 4 hours looping on a quick
>>> initialization program) segv with the following stack trace.
>>>
>>> =>[1] mca_btl_sm_add_procs(btl = 0xfffffd7ffdb67ef0, nprocs = 2U, procs =
>>> 0x591560, peers = 0x591580, reachability = 0xfffffd7fffdff000), line 519
>>> in "btl_sm.c"
>>> [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints =
>>> 0x591500, reachable = 0xfffffd7fffdff000), line 222 in "bml_r2.c"
>>> [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in
>>> "pml_ob1.c"
>>> [4] ompi_mpi_init(argc = 1, argv = 0xfffffd7fffdff318, requested = 0,
>>> provided = 0xfffffd7fffdff234), line 651 in "ompi_mpi_init.c"
>>> [5] PMPI_Init(argc = 0xfffffd7fffdff2ec, argv = 0xfffffd7fffdff2e0), line
>>> 90 in "pinit.c"
>>> [6] main(argc = 1, argv = 0xfffffd7fffdff318), line 82 in "buffer.c"
>>>
>>> I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains
>>> uninitialized data causes the loop on line 504 in btl_sm.c to think that a
>>> remote rank has set its fifo address.
>>>
>>> Has anyone else seen the above happening?
>>>
>>> --td
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>