Using DR was suggested to see if it could find an error. The original
problem was using OB1, and HPL gave failed residuals. The hope was
that DR would pinpoint any problems. It did not and HPL did not
progress at all (the GM counters incremented, but no tests were
completed successfully or not).
Using the btl_gm_min_rdma_size flag, OB1 now completes without failed
residuals in HPL.
This flag sets the threshold where BTL will fragment RDMAs (not start
using RDMA) per $OMPI/ompi/mca/btl/btl.h:
size_t btl_min_rdma_size; /**< threshold below which the
BTL should not fragment */
size_t btl_max_rdma_size; /**< maximum rdma fragment
size supported by the BTL */
We believe it is the fragmenting of RDMAs on OSX that is causing the
issue. It does not happen on x86 or x86_64.
On Dec 7, 2006, at 2:20 PM, George Bosilca wrote:
> Something is not clear for me in this discussion. Sometimes the
> subject was the DR PML and sometimes the OB1 PML. In fact I'm
> completely in the dark ... Which PML fails the HPCC test on MAC ?
> When I look at the command line it look like it should be OB1 not
> DR ...
> On Dec 7, 2006, at 1:59 PM, Brock Palen wrote:
>> That is wonderful, that fixes the observed problem for running with
>> OB1. Has a bug for this been filed to get RDMA working on macs?
>> The only working MPI lib is MPICH-GM as this problem happens with
>> LAM-7.1.3 also.
>> So on track for one bug.
>> Would the person working on the DR PML like me to try anymore tests?
>> Brock Palen
>> Center for Advanced Computing
>> On Dec 7, 2006, at 9:50 AM, Scott Atchley wrote:
>>> On Dec 6, 2006, at 3:09 PM, Scott Atchley wrote:
>>>> Brock and Galen,
>>>> We are willing to assist. Our best guess is that OMPI is using the
>>>> code in a way different than MPICH-GM does. One of our other
>>>> developers who is more comfortable with the GM API is looking into
>>> We tried running with HPCC with:
>>> $ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca
>>> btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm
>>> and HPL passes. The problem seems to be in the RDMA fragmenting code
>>> on OSX. The boundary values at the edges of the fragments are not
>>> users mailing list
>> users mailing list
> users mailing list