Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] heterogeneous OpenFabrics adapters
From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2008-05-13 05:46:15


Jeff,
Your proposal for 1.3 sounds ok for me.

For 1.4 we need review this point again. The qp information is split
over 3 different structs:
mca_btl_openub_module_qp_t (used by module), mca_btl_openib_qp_t (used
by component) and mca_btl_openib_endpoint_qp_t (used by endpoint).
Need see how we will resolve the issue for each of them. Lets put it to
1.4 todo list.

Pasha.

Jeff Squyres wrote:
> Short version:
> --------------
>
> I propose that we should disallow multiple different
> mca_btl_openib_receive_queues values (or receive_queues values from
> the INI file) to be used in a single MPI job for the v1.3 series.
>
> More details:
> -------------
>
> The reason I'm looking into this heterogeneity stuff is to help
> Chelsio support their iWARP NIC in OMPI. Their NIC needs a specific
> value for mca_btl_openib_receive_queues (specifically: it does not
> support SRQ and it has the wireup race condition that we discussed
> before).
>
> The major problem is that all the BSRQ information is currently stored
> in on the openib component -- it is *not* maintained on a per-HCA (or
> per port) basis. We *could* move all the BSRQ info to live on the
> hca_t struct (or even the openib module struct), but it has at least 3
> big consequences:
>
> 1. It would touch a lot of code. But touching all this code is
> relatively low risk; it will be easy to check for correctness because
> the changes will either compile or not.
>
> 2. There are functions (some of which are static inline) that read the
> BSRQ data. These functions would have to take an additional (hca_t*)
> (or (btl_openib_module_t*)) parameter.
>
> 3. Getting to the BSRQ info will take at least 1 or 2 more
> dereferences (e.g., module->hca->bsrq_info or module->bsrq_info...).
>
> I'm not too concerned about #1 (it's grunt work), but I am a bit
> concerned about #2 and #3 since at least some of these places are in
> the critical performance path.
>
> Given these concerns, I propose the following v1.3:
>
> - Add a "receive_queues" field to the INI file so that the Chelsio
> adapter can run out of the box (i.e., "mpirun -np 4 a.out" with hosts
> containing Chelsio NICs will get a value for btl_openib_receive_queues
> that will work).
>
> - NetEffect NICs will also require overriding
> btl_openib_receive_queues, but will likely have a different value than
> Chelsio NICs (they don't have the wireup race condition that Chelsio
> does).
>
> - Because the BSRQ info is on the component (i.e., global), we should
> detect when multiple different receive_queues values are specified and
> gracefully abort.
>
> I think it'll be quite uncommon to have a need for two different
> receive_queues values, and that this proposal will be fine for v1.3
>
> Comments?
>
>
>
> On May 12, 2008, at 6:44 PM, Jeff Squyres wrote:
>
>
>> After looking at the code a bit, I realized that I completely forgot
>> that the INI file was invented to solve at least the heterogeneous-
>> adapters-in-a-host problem.
>>
>> So I amended the ticket to reflect that that problem is already
>> solved. The other part is not, though -- consider two MPI procs on
>> different hosts, each with an iWARP NIC, but one NIC supports SRQs and
>> one does not.
>>
>>
>> On May 12, 2008, at 5:36 PM, Jeff Squyres wrote:
>>
>>
>>> I think that this issue has come up before, but I filed a ticket
>>> about it because at least one developer (Jon) has a system with both
>>> IB and iWARP adapters:
>>>
>>> https://svn.open-mpi.org/trac/ompi/ticket/1282
>>>
>>> My question: do we care about the heterogeneous adapter scenarios?
>>> For v1.3? For v1.4? For ...some version in the future?
>>>
>>> I think the first issue I identified in the ticket is grunt work to
>>> fix (annoying and tedious, but not difficult), but the second one
>>> will be a little dicey -- it has scalability issues (e.g., sending
>>> around all info in the modex, etc.).
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>

-- 
Pavel Shamis (Pasha)
Mellanox Technologies