On Dec 6, 2006, at 2:29 PM, Brock Palen wrote:
>> I wonder if we can narrow this down a bit to perhaps a PML protocol
>> Start by disabling RDMA by using:
>> -mca btl_gm_flags 1
> On the other-hand, with OB1 using btl_gm_flags 1 fixed the error
> problem with OMPI! Which is a great first step.
> mpirun -np 4 --mca btl_gm_flags 1 ./xhpl
> Allowed HPL to run with no errors. I verified the performance was
> better than when ran without gm
> (added --mca btl ^gm )
> So still a problem with DR which i dont need but im willing to help
> test it.
> Can we look into why leaving RDMA on if causing a problem?
Brock and Galen,
We are willing to assist. Our best guess is that OMPI is using the
code in a way different than MPICH-GM does. One of our other
developers who is more comfortable with the GM API is looking into it.
Testing with HPCC, in addition to the HPL failed residuals, I am also
seeing these messages:
: ERROR: from right: expected 2 and 3 as first and last byte, but
got 2 and 5 instead
: ERROR: from right: expected 3 and 4 as first and last byte, but
got 3 and 7 instead
: ERROR: from right: expected 4 and 5 as first and last byte, but
got 4 and 3 instead
: ERROR: from right: expected 7 and 8 as first and last byte, but
got 7 and 5 instead
which is from $HPCC/src/bench_lat_bw_1.5.2.c.