I'm not arguing against the choices you made. I just want to pinpoint that there are other way of achieving a similar goal with less impact on the components outside of the SMCUDA BTL, an approach which might be interesting in the long run. Thus my answer below is just so there is a trace of a possible starting point toward a less intrusive approach, in case someone is interested in implementing in the future (an intern or some PhD student).
On Aug 23, 2013, at 21:53 , Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
> Yes, I agree that the CUDA support is more intrusive and ends up in different areas. The problem is that the changes could not be simply isolated in a BTL.
> 1. To support the direct movement of GPU buffers, we often utilize copying into host memory and then out of host memory. These copies have to be done utilizing the cuMemcpy() functions rather than the memcpy() function. This is why some changes ended up in the opal datatype area. Must of that copying is driven by the convertor.
Each proc has a master_convertor which is a specialized class of convertors used for the target processor(s), and this specialized convertor is cloned for every new request. Some of these convertors support heterogeneous operations, some of them checksums and some of them just memcpy. The CUDA could have become just another class of convertors, with their definition visible only in the SMCUDA component.
> 2. I added support for doing an asynchronous copy into and out of host buffers. This ended up touching datatype, PML, and BTL layers.
I'm not sure I understand the issue here, but what you do in the BTL is none of the PML business as long as you respect the rules. For message logging we have to copy the data locally in addition to sending it over the network, and we do not sequentialize these two steps, the can progress simultaneously (supported by extra threads eventually).
> 3. A GPU buffer may utilize a different protocol than a HOST buffer within a BTL. This required me to find different ways to direct which PML protocol to use. In addition, it is assume that a BTL either supports RDMA or not. There is no idea of supporting based on the type of buffer one is sending.
The change where the PML is using the flags on the endpoint instead of the one on the BTL is a good change, allowing far more flexibility on the handling of the transfer protocols.
Based on my understudying of your code, you changed the AM tag used by the PML to send the message something that could have been done easily in the BTL as an example by providing different path for the PUT/GET operations based on the memory ownership. Basically I could not figure out why the decision of using specialized RDMA (IPC based) should be reported upstream to the PML instead of being a totally local decision on the BTL and exposed to the PML as a single PUT/GET interface.
> Therefore, to leverage much of the existing datatype and PML support, I had to make changes in there. Overall, I agree it is not ideal, but the best I could come up with.
>> -----Original Message-----
>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>> Sent: Friday, August 23, 2013 7:36 AM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>> trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>> On Aug 22, 2013, at 19:24 , Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
>>> Hi George:
>>> The reason it tainted the PML is because the CUDA IPC support makes use
>> of the large message RDMA protocol of the PML layer. The smcuda btl starts
>> up, but does not initially support any large message RDMA (RGET,RPUT)
>> protocols. Then when a GPU buffer is first accessed, the smcuda btl starts an
>> exchange of some control messages with its peer. If they determine that
>> they can support CUDA IPC, then the smcuda calls up into the PML layer and
>> says it is OK to start using the large message RDMA. This all happens in code
>> that is only compiled in if the user asks for CUDA-aware support.
>> The issue is not that is compiled in only when CUDA support is enabled. The
>> problem is that it is a breakage of the separation between PML and BTL.
>> Almost all BTL did manage to implement highly optimized protocols under this
>> design without having to taint the PML. Very similarly to CUDA I can cite CMA
>> and KNEM support in the SM BTL. So I really wonder why the CUDA support is
>> so different, at the point where it had to go all over the place (convertor,
>> memory pool and PML)?
>>> The key requirement was I wanted to dynamically add the support for CUDA
>> IPC when the user first started accessing GPU buffers rather than during
>> Moving from BTL flags to endpoint based flags is indeed a good thing. This is
>> something that should be done everywhere in the PML code, as it will allow
>> the BTL to support different behaviors based on the peer.
>>> This the best way I could figure out how to accomplish this but I am open to
>> other ideas.
>>>> -----Original Message-----
>>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>>>> Sent: Thursday, August 22, 2013 11:32 AM
>>>> To: devel_at_[hidden]
>>>> Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>>>> trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>>>> I'm not very keen of seeing BTL modification tainting the PML. I
>>>> would have expected support for IPC between GPU must be a BTL-level
>>>> decision, no a special path in the PML.
>>>> Is there a reason IPC support cannot be hidden down in the SMCUDA BTL?
>>>> On Aug 21, 2013, at 23:00 , svn-commit-mailer_at_[hidden] wrote:
>>>>> Author: rolfv (Rolf Vandevaart)
>>>>> Date: 2013-08-21 17:00:09 EDT (Wed, 21 Aug 2013) New Revision: 29055
>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29055
>>>>> Fix support in smcuda btl so it does not blow up when there is no
>>>>> CUDA IPC
>>>> support between two GPUs. Also make it so CUDA IPC support is added
>>>>> Fixes ticket 3531.
>>>>> Text files modified:
>>>>> trunk/ompi/mca/btl/btl.h | 2
>>>>> trunk/ompi/mca/btl/smcuda/README | 113
>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda.c | 104
>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda.h | 28 +++++
>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda_component.c | 200
>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda_endpoint.h | 5 +
>>>>> trunk/ompi/mca/common/cuda/common_cuda.c | 29 +++++
>>>>> trunk/ompi/mca/common/cuda/common_cuda.h | 3
>>>>> trunk/ompi/mca/pml/ob1/pml_ob1.c | 11 ++
>>>>> trunk/ompi/mca/pml/ob1/pml_ob1_cuda.c | 42 ++++++++
>>>>> trunk/ompi/mca/pml/ob1/pml_ob1_recvreq.c | 6
>>>>> 11 files changed, 535 insertions(+), 8 deletions(-)
>>>> devel mailing list
>>> ------------- This email message is for the sole use of the intended
>>> recipient(s) and may contain confidential information. Any
>>> unauthorized review, use, disclosure or distribution is prohibited.
>>> If you are not the intended recipient, please contact the sender by
>>> reply email and destroy all copies of the original message.
>>> ------------- _______________________________________________
>>> devel mailing list
>> devel mailing list
> devel mailing list