Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brock Palen (brockp_at_[hidden])
Date: 2006-12-07 15:33:02


On Dec 7, 2006, at 3:14 PM, George Bosilca wrote:

>
> On Dec 7, 2006, at 2:45 PM, Brock Palen wrote:
>
>>>>>
>>>>> $ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca
>>>>> btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm
>>>>>
>>>>> and HPL passes. The problem seems to be in the RDMA fragmenting
>>>>> code
>>>>> on OSX. The boundary values at the edges of the fragments are not
>>>>> correct.
>
> Here it look like the OB1 PML was used. In order to get HPL to
> complete successfully we need to set the btl_gm_min_rdma_size to
> 10MB. What I suspect is that 10MB is more than the size of any
> message HPL exchange, so adding this MCA parameter effectively
> disable the RDMA protocol for GM.
>
> This seems to pinpoint a more complex problem which might not be
> related to the PML. If both PMLs (OB1 and DR) have a similar problem
> when running on top of the GM BTL it might indicate the problem is
> down in the GM BTL. Can you confirm that running with OB1 and GM on
> this particular cluster HPL fails ?

If not modifying the btl_gm_min_rdma_size the run fails with bad
results when using OB1.
If btl_gm_min_rdma is modified (as you pointed out basically
disabled then) It no-longer fails.

Using DR over ethernet (--mca btl ^gm) or over gm (with and
without the btl_gm_min_rdma_size modified) does not even start up.
(nothing on stdout stderr and never exits).

Yes there is a problem at the btl level. But because the problem is
different and presists across both GM and TCP, I believe we are into
two separate issues. But I am not the person to make that call.

Brock

>
> Thanks,
> george.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>