Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc: mpirun hangs on ia64
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2014-01-23 16:16:26


Some progress:

I fixed IA64.asm but still saw failures.
I realized I'd not checked the ia64/atomic.h file.

Lo and behold the origin of the bogus "sxt4" is a pair of improper casts,
removed by the following:
--- opal/include/opal/sys/ia64/atomic.h~ 2014-01-23
13:04:03.000000000 -0800
+++ opal/include/opal/sys/ia64/atomic.h 2014-01-23 13:04:42.000000000 -0800
@@ -119,7 +119,7 @@
     __asm__ __volatile__ ("cmpxchg8.acq %0=[%1],%2,ar.ccv":
                   "=r"(ret) : "r"(addr), "r"(newval) : "memory");

- return ((int32_t)ret == oldval);
+ return (ret == oldval);
 }

@@ -132,7 +132,7 @@
     __asm__ __volatile__ ("cmpxchg8.rel %0=[%1],%2,ar.ccv":
                   "=r"(ret) : "r"(addr), "r"(newval) : "memory");

- return ((int32_t)ret == oldval);
+ return (ret == oldval);
 }

 #endif /* OMPI_GCC_INLINE_ASSEMBLY */

I will retest ASAP and report with, I hope, an attachment to fix both
IA64.asm and ia64/atomic.h

-Paul

On Wed, Jan 22, 2014 at 4:24 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

>
> On Wed, Jan 22, 2014 at 2:22 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> My ia64 asm is a bit rusty, but I'll give a quick look if/when I can.
>
>
> I had a look (in v1.7) and this is what I see:
>
> $cat -n IA64.asm | grep -A14 opal_atomic_cmpset_acq_64:
> 70 opal_atomic_cmpset_acq_64:
> 71 .prologue
> 72 .body
> 73 mov ar.ccv=r33;;
> 74 cmpxchg8.acq r32=[r32],r34,ar.ccv
> 75 ;;
> 76 sxt4 r32 = r32
> 77 ;;
> 78 cmp.eq p6, p7 = r33, r32
> 79 ;;
> 80 (p6) addl r8 = 1, r0
> 81 (p7) mov r8 = r0
> 82 br.ret.sptk.many b0
> 83 ;;
> 84 .endp opal_atomic_cmpset_acq_64#
>
> The (approximate and non-atomic) C equivalent is:
>
> // r32 = address
> // r33 = oldvalue
> // r34 = newvalue
> int opal_atomic_cmpset_acq_64(int64_t r32, int64_t r33, int64 r34) {
> int64_t ccv = r33; // L73
> if (*(int64_t *)r32 == ccv) *(int64_t *)r32 = r34; // L74
>
> r32 = (int64_t)(int32_t)r32; // L76 = sign-extend 32->64
>
> bool p6, p7;
> p7 = !(p6 = (r33 == r32)); // L78
>
> const int r0 = 0;
> int r8;
> if (p6) r8 = 1 + r0; // L80
> if (p7) r8 = r0; // L81
> return r8; // L82
> }
>
> Which is fine except that line 76 is totally wrong!!
> The "sxt4" instruction is "sign-extend from 4 bytes to 8 bytes".
> Thus the upper 32-bits of the value read from memory are lost!
> Unless the upper 33 bits off r33 (oldvalue) are all 0s or all 1s, the
> comparison on line 78 MUST fail.
> This explains the hang, as the lifo push will loop indefinitely waiting
> for the success of this cmpset.
>
> Note the same erroneous instruction is also present in the _rel variant
> (at line 94).
> The trunk has the same issue.
> This code has not changed at all since IA64.asm was added way back in
> r4471.
>
> I won't have access to the IA64 platform again until tomorrow AM.
> So, testing my hypothesis will need to wait.
>
> BTW:
> IFF I am right about the source of this problem, then it would be
> beneficial to have (and I may contribute) a stronger test (for "make
> check") that would detect this sort of bug in the atomics (specifically
> look for both false-positive and false-negative return value from 64-bit
> cmpset operations with values satisfying a range of "corner cases"). I
> think I have single-bit and double-bit "marching tests" for cmpset in my
> own arsenal of tests for GASNet's atomics. If I don't have time to
> contribute a complete test, I can at least contribute that logic for
> somebody else to port to the OPAL atomics.
>
> -Paul
>
> P.S.:
> The cmpxchgN for N in 1,2,4 are documented as ZERO-extending their loads
> to 64-bits.
> So, there is a slim chance that the sxt4 actually was intended for the
> 32-bit cmpset code.
> However, since the comparison used there is a "cmp4.eq" the "sxt4" would
> still not be needed.
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900