Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fixing SPARC bus errors
From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2008-01-04 13:49:01


Hello George:

While the change on the shm side does initially seem unnecessary, it
is handling a bus error case on the sending side, not on the receiving
side.

The change in the mca_btl_sm_hdr_t is necessary because of the way the
pml and btl headers are stored in shared memory and because of the
fact that in some cases, the pml header has a uint64_t in it. If the
mca_btl_sm_hdr_t is size 12, then the pml header does not start on a
double-word aligned boundary. In the case the pml header is a
mca_pml_ob1_rendezvous_hdr_t, we get a bus error while loading the
hdr_msg_length. Here is an example of it although it can happen in
other places as well. (Line numbers are close to what is in the trunk
give or take a few lines)

program terminated by signal BUS (invalid address alignment)
Current function is mca_pml_ob1_send_request_start_rndv (optimized)
   743 hdr->hdr_rndv.hdr_msg_length = sendreq->req_send.req_bytes_packed;
  (dbx) print &(hdr->hdr_rndv.hdr_msg_length)
&hdr->hdr_rndv.hdr_msg_length = 0xf4d1e81c
  (dbx) where
=>[1] mca_pml_ob1_send_request_start_rndv() (optimized),
           at 0xfd5f76b8 (line ~743) in "pml_ob1_sendreq.c"
   [2] mca_pml_ob1_send_request_start() (optimized),
           at 0xfd5d013c (line ~388) in "pml_ob1_sendreq.h"
   [3] mca_pml_ob1_send() (optimized), at 0xfd5d1544 (line ~117) in
"pml_ob1_isend.c"
   [4] PMPI_Send), at 0xfedd7204 (line ~65) in "psend.c"
   [5] main(0xffbfed40, 0xfffffff8, 0x2, 0x0, 0x7d1, 0x7d0), at 0x125bc
(dbx)

George Bosilca wrote:
> Rolf,
>
> If we memcpy instead of assigning the header in the OB1 PML why do we
> need the padding in the frag header ?
>
> Thanks,
> george.
>
> On Jan 3, 2008, at 2:47 PM, Rolf vandeVaart wrote:
>
>>
>> Greetings. We have seen some bus errors when compiling a user
>> application with certain compiler flags and running on a sparc based
>> server. The issue is that some structures are not word or double word
>> aligned causing a bus error. I have tracked down two places where I can
>> make a minor change and everything seems to work fine. However, I want
>> to see if anyone has issues with these changes. The two changes are
>> shown below.
>>
>> burl-ct-v440-0 206 =>svn diff
>> Index: ompi/mca/btl/sm/btl_sm_frag.h
>> ===================================================================
>> --- ompi/mca/btl/sm/btl_sm_frag.h (revision 17039)
>> +++ ompi/mca/btl/sm/btl_sm_frag.h (working copy)
>> @@ -9,6 +9,7 @@
>> * University of Stuttgart. All rights reserved.
>> * Copyright (c) 2004-2005 The Regents of the University of California.
>> * All rights reserved.
>> + * Copyright (c) 2008 Sun Microsystems, Inc. All rights reserved.
>> * $COPYRIGHT$
>> * * Additional copyrights may follow
>> @@ -41,6 +42,10 @@
>> struct mca_btl_sm_frag_t *frag;
>> size_t len;
>> mca_btl_base_tag_t tag;
>> + /* Add a 4 byte pad to round out structure to 16 bytes for 32-bit
>> + * and to 24 bytes for 64-bit. Helps prevent bus errors for strict
>> + * alignment cases like SPARC. */
>> + char pad[4];
>> };
>> typedef struct mca_btl_sm_hdr_t mca_btl_sm_hdr_t;
>>
>>
>> Index: ompi/mca/pml/ob1/pml_ob1_recvfrag.h
>> ===================================================================
>> --- ompi/mca/pml/ob1/pml_ob1_recvfrag.h (revision 17039)
>> +++ ompi/mca/pml/ob1/pml_ob1_recvfrag.h (working copy)
>> @@ -9,6 +9,7 @@
>> * University of Stuttgart. All rights reserved.
>> * Copyright (c) 2004-2005 The Regents of the University of California.
>> * All rights reserved.
>> + * Copyright (c) 2008 Sun Microsystems, Inc. All rights reserved.
>> * $COPYRIGHT$
>> * * Additional copyrights may follow
>> @@ -67,7 +68,8 @@
>> unsigned char* _ptr = (unsigned char*)frag->addr; \
>> /* init recv_frag */ \
>> frag->btl = btl; \
>> - frag->hdr =
>> *(mca_pml_ob1_hdr_t*)hdr; \
>> + memcpy(&frag->hdr, (void
>> *)((mca_pml_ob1_hdr_t*)hdr) \
>> +
>> sizeof(mca_pml_ob1_hdr_t)); \
>> frag->num_segments = 1; \
>> _size = segs[0].seg_len; \
>> for( i = 1; i < cnt; i++ ) { \
>> burl-ct-v440-0 207 =>
>>
>>
>> The ticket associated with this issue is
>> https://svn.open-mpi.org/trac/ompi/ticket/1148
>>
>> Rolf
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
=========================
rolf.vandevaart_at_[hidden]
781-442-3043
=========================