George, I appreciate the thought you have put into these comments. Perhaps I can try and make some of the changes as you suggest, but as you noted, sometime in the future. In theory, I agree with them, but in practice they are not so easy to implement. I have added a few more comments below. Also note, I am hoping to move this change into Open MPI 1.7.4 and have you as a reviewer for the ticket.
>From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>Sent: Wednesday, August 28, 2013 5:12 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>I'm not arguing against the choices you made. I just want to pinpoint that
>there are other way of achieving a similar goal with less impact on the
>components outside of the SMCUDA BTL, an approach which might be
>interesting in the long run. Thus my answer below is just so there is a trace of
>a possible starting point toward a less intrusive approach, in case someone is
>interested in implementing in the future (an intern or some PhD student).
>On Aug 23, 2013, at 21:53 , Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
>> Yes, I agree that the CUDA support is more intrusive and ends up in
>different areas. The problem is that the changes could not be simply isolated
>in a BTL.
>> 1. To support the direct movement of GPU buffers, we often utilize copying
>>into host memory and then out of host memory. These copies have to be
>>done utilizing the cuMemcpy() functions rather than the memcpy() function.
>>This is why some changes ended up in the opal datatype area. Must of that
>>copying is driven by the convertor.
>Each proc has a master_convertor which is a specialized class of convertors
>used for the target processor(s), and this specialized convertor is cloned for
>every new request. Some of these convertors support heterogeneous
>operations, some of them checksums and some of them just memcpy. The
>CUDA could have become just another class of convertors, with their
>definition visible only in the SMCUDA component.
This would be nice to do. I just do not know how to implement. Note that this is not used only within smcuda BTL. It is also used in openib BTL and any other BTL if they end up supporting GPU buffers.
>> 2. I added support for doing an asynchronous copy into and out of host
>>buffers. This ended up touching datatype, PML, and BTL layers.
>I'm not sure I understand the issue here, but what you do in the BTL is none of
>the PML business as long as you respect the rules. For message logging we
>have to copy the data locally in addition to sending it over the network, and
>we do not sequentialize these two steps, the can progress simultaneously
>(supported by extra threads eventually).
This touched the datatype layer because I added a field into the convertor (a stream). I would have to think more about this one.
>> 3. A GPU buffer may utilize a different protocol than a HOST buffer within a
>BTL. This required me to find different ways to direct which PML protocol to
>use. In addition, it is assume that a BTL either supports RDMA or not. There is
>no idea of supporting based on the type of buffer one is sending.
>The change where the PML is using the flags on the endpoint instead of the
>one on the BTL is a good change, allowing far more flexibility on the handling
>of the transfer protocols.
>Based on my understudying of your code, you changed the AM tag used by
>the PML to send the message something that could have been done easily in
>the BTL as an example by providing different path for the PUT/GET operations
>based on the memory ownership. Basically I could not figure out why the
>decision of using specialized RDMA (IPC based) should be reported upstream
>to the PML instead of being a totally local decision on the BTL and exposed to
>the PML as a single PUT/GET interface.
I am using a new active message tag that is local to the smcuda for the smcuda BTLs to decide if they can support RDMA of GPU memory. When they agree that they can, I then piggy backed on the error handler callback support from the PML. I created a new flag so that when we call the error handler with that flag, the PML realizes it wants to adjust to add RDMA of GPU buffers, and adjusts the flag on the endpoint. And for this case, this can happen sometime after MPI_Init(), so that is why we call into the PML layer sometime later. For all other cases, the decision to support large message RDMA is determined at MPI_Init() time. And we need to report it upstream because the that is how the PML determines what protocol it will instruct the BTL to do.
>> Therefore, to leverage much of the existing datatype and PML support, I
>had to make changes in there. Overall, I agree it is not ideal, but the best I
>could come up with.
>>> -----Original Message-----
>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>>> Sent: Friday, August 23, 2013 7:36 AM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>>> trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>>> On Aug 22, 2013, at 19:24 , Rolf vandeVaart <rvandevaart_at_[hidden]>
>>>> Hi George:
>>>> The reason it tainted the PML is because the CUDA IPC support makes
>>> of the large message RDMA protocol of the PML layer. The smcuda btl
>>> starts up, but does not initially support any large message RDMA
>>> (RGET,RPUT) protocols. Then when a GPU buffer is first accessed, the
>>> smcuda btl starts an exchange of some control messages with its peer.
>>> If they determine that they can support CUDA IPC, then the smcuda
>>> calls up into the PML layer and says it is OK to start using the
>>> large message RDMA. This all happens in code that is only compiled in if
>the user asks for CUDA-aware support.
>>> The issue is not that is compiled in only when CUDA support is
>>> enabled. The problem is that it is a breakage of the separation between
>PML and BTL.
>>> Almost all BTL did manage to implement highly optimized protocols
>>> under this design without having to taint the PML. Very similarly to
>>> CUDA I can cite CMA and KNEM support in the SM BTL. So I really
>>> wonder why the CUDA support is so different, at the point where it
>>> had to go all over the place (convertor, memory pool and PML)?
>>>> The key requirement was I wanted to dynamically add the support for
>>> IPC when the user first started accessing GPU buffers rather than
>>> during MPI_Init.
>>> Moving from BTL flags to endpoint based flags is indeed a good thing.
>>> This is something that should be done everywhere in the PML code, as
>>> it will allow the BTL to support different behaviors based on the peer.
>>>> This the best way I could figure out how to accomplish this but I am
>>>> open to
>>> other ideas.
>>>>> -----Original Message-----
>>>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of
>>>>> Sent: Thursday, August 22, 2013 11:32 AM
>>>>> To: devel_at_[hidden]
>>>>> Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>>>>> trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>>>>> I'm not very keen of seeing BTL modification tainting the PML. I
>>>>> would have expected support for IPC between GPU must be a BTL-level
>>>>> decision, no a special path in the PML.
>>>>> Is there a reason IPC support cannot be hidden down in the SMCUDA
>>>>> On Aug 21, 2013, at 23:00 , svn-commit-mailer_at_[hidden] wrote:
>>>>>> Author: rolfv (Rolf Vandevaart)
>>>>>> Date: 2013-08-21 17:00:09 EDT (Wed, 21 Aug 2013) New Revision:
>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29055
>>>>>> Fix support in smcuda btl so it does not blow up when there is no
>>>>>> CUDA IPC
>>>>> support between two GPUs. Also make it so CUDA IPC support is added
>>>>>> Fixes ticket 3531.
>>>>>> Text files modified:
>>>>>> trunk/ompi/mca/btl/btl.h | 2
>>>>>> trunk/ompi/mca/btl/smcuda/README | 113
>>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda.c | 104
>>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda.h | 28 +++++
>>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda_component.c | 200
>>>>>> trunk/ompi/mca/btl/smcuda/btl_smcuda_endpoint.h | 5 +
>>>>>> trunk/ompi/mca/common/cuda/common_cuda.c | 29 +++++
>>>>>> trunk/ompi/mca/common/cuda/common_cuda.h | 3
>>>>>> trunk/ompi/mca/pml/ob1/pml_ob1.c | 11 ++
>>>>>> trunk/ompi/mca/pml/ob1/pml_ob1_cuda.c | 42 ++++++++
>>>>>> trunk/ompi/mca/pml/ob1/pml_ob1_recvreq.c | 6
>>>>>> 11 files changed, 535 insertions(+), 8 deletions(-)
>>>>> devel mailing list
>>>> ------------- This email message is for the sole use of the intended
>>>> recipient(s) and may contain confidential information. Any
>>>> unauthorized review, use, disclosure or distribution is prohibited.
>>>> If you are not the intended recipient, please contact the sender by
>>>> reply email and destroy all copies of the original message.
>>>> ------------- _______________________________________________
>>>> devel mailing list
>>> devel mailing list
>> devel mailing list
>devel mailing list