On Dec 7, 2006, at 3:14 PM, George Bosilca wrote:
> On Dec 7, 2006, at 2:45 PM, Brock Palen wrote:
>>>>> $ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca
>>>>> btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm
>>>>> and HPL passes. The problem seems to be in the RDMA fragmenting
>>>>> on OSX. The boundary values at the edges of the fragments are not
> Here it look like the OB1 PML was used. In order to get HPL to
> complete successfully we need to set the btl_gm_min_rdma_size to
> 10MB. What I suspect is that 10MB is more than the size of any
> message HPL exchange, so adding this MCA parameter effectively
> disable the RDMA protocol for GM.
> This seems to pinpoint a more complex problem which might not be
> related to the PML. If both PMLs (OB1 and DR) have a similar problem
> when running on top of the GM BTL it might indicate the problem is
> down in the GM BTL. Can you confirm that running with OB1 and GM on
> this particular cluster HPL fails ?
If not modifying the btl_gm_min_rdma_size the run fails with bad
results when using OB1.
If btl_gm_min_rdma is modified (as you pointed out basically
disabled then) It no-longer fails.
Using DR over ethernet (--mca btl ^gm) or over gm (with and
without the btl_gm_min_rdma_size modified) does not even start up.
(nothing on stdout stderr and never exits).
Yes there is a problem at the btl level. But because the problem is
different and presists across both GM and TCP, I believe we are into
two separate issues. But I am not the person to make that call.
> users mailing list