This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
On Dec 7, 2006, at 3:14 PM, George Bosilca wrote:
> On Dec 7, 2006, at 2:45 PM, Brock Palen wrote:
>>>>> $ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca
>>>>> btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm
>>>>> and HPL passes. The problem seems to be in the RDMA fragmenting
>>>>> on OSX. The boundary values at the edges of the fragments are not
> Here it look like the OB1 PML was used. In order to get HPL to
> complete successfully we need to set the btl_gm_min_rdma_size to
> 10MB. What I suspect is that 10MB is more than the size of any
> message HPL exchange, so adding this MCA parameter effectively
> disable the RDMA protocol for GM.
> This seems to pinpoint a more complex problem which might not be
> related to the PML. If both PMLs (OB1 and DR) have a similar problem
> when running on top of the GM BTL it might indicate the problem is
> down in the GM BTL. Can you confirm that running with OB1 and GM on
> this particular cluster HPL fails ?
If not modifying the btl_gm_min_rdma_size the run fails with bad
results when using OB1.
If btl_gm_min_rdma is modified (as you pointed out basically
disabled then) It no-longer fails.
Using DR over ethernet (--mca btl ^gm) or over gm (with and
without the btl_gm_min_rdma_size modified) does not even start up.
(nothing on stdout stderr and never exits).
Yes there is a problem at the btl level. But because the problem is
different and presists across both GM and TCP, I believe we are into
two separate issues. But I am not the person to make that call.
> users mailing list