On Dec 7, 2006, at 2:45 PM, Brock Palen wrote:
>>>> $ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca
>>>> btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm
>>>> and HPL passes. The problem seems to be in the RDMA fragmenting
>>>> on OSX. The boundary values at the edges of the fragments are not
Here it look like the OB1 PML was used. In order to get HPL to
complete successfully we need to set the btl_gm_min_rdma_size to
10MB. What I suspect is that 10MB is more than the size of any
message HPL exchange, so adding this MCA parameter effectively
disable the RDMA protocol for GM.
This seems to pinpoint a more complex problem which might not be
related to the PML. If both PMLs (OB1 and DR) have a similar problem
when running on top of the GM BTL it might indicate the problem is
down in the GM BTL. Can you confirm that running with OB1 and GM on
this particular cluster HPL fails ?