Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
From: Tang, Changqing (changquing.tang_at_[hidden])
Date: 2007-12-20 11:24:09


Jack:
        Thanks for adding this new function, this is what we need.
There is one issue I want to make clear,

This new "kernel" owned QP "will be destroyed when the XRC domain is
closed
(i.e., as part of a ibv_close_xrc_domain call, but only when the
domain's reference count goes to zero) "

        If I have a MPI server processes on a node, many other MPI
client processes will dynamically
connect/disconnect with the server. The server use same XRC domain.

        Will this cause accumulating the "kernel" QP for such
application ? we want the server to run 365 days
a year.

Thanks.
--CQ

> -----Original Message-----
> From: Pavel Shamis (Pasha) [mailto:pasha_at_[hidden]]
> Sent: Thursday, December 20, 2007 9:15 AM
> To: Jack Morgenstein
> Cc: Tang, Changqing; Roland Dreier;
> general_at_[hidden]; Open MPI Developers;
> mvapich-discuss_at_[hidden]
> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> independent of any one user process
>
> Adding Open MPI and MVAPICH community to the thread.
>
> Pasha (Pavel Shamis)
>
> Jack Morgenstein wrote:
>> background: see "XRC Cleanup order issue thread" at
>>
>>
>>
> http://lists.openfabrics.org/pipermail/general/2007-December/043935.ht
>> ml
>>
>> (userspace process which created the receiving XRC qp on a
> given host
>> dies before other processes which still need to receive XRC
> messages
>> on their SRQs which are "paired" with the now-destroyed
> receiving XRC
>> QP.)
>>
>> Solution: Add a userspace verb (as part of the XRC suite) which
>> enables the user process to create an XRC QP owned by the
> kernel -- which belongs to the required XRC domain.
>>
>> This QP will be destroyed when the XRC domain is closed
> (i.e., as part
>> of a ibv_close_xrc_domain call, but only when the domain's
> reference count goes to zero).
>>
>> Below, I give the new userspace API for this function. Any
> feedback will be appreciated.
>> This API will be implemented in the upcoming OFED 1.3
> release, so we need feedback ASAP.
>>
>> Notes:
>> 1. There is no query or destroy verb for this QP. There is
> also no userspace object for the
>> QP. Userspace has ONLY the raw qp number to use when
> creating the (X)RC connection.
>>
>> 2. Since the QP is "owned" by kernel space, async events
> for this QP are also handled in kernel
>> space (i.e., reported in /var/log/messages). There are
> no completion events for the QP, since
>> it does not send, and all receives completions are
> reported in the XRC SRQ's cq.
>>
>> If this QP enters the error state, the remote QP which
> sends will start receiving RETRY_EXCEEDED
>> errors, so the application will be aware of the failure.
>>
>> - Jack
>>
> ======================================================================
>> ================
>> /**
>> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
> a receive-side only QP,
>> * and moves the created qp through the RESET->INIT and
> INIT->RTR transitions.
>> * (The RTR->RTS transition is not needed, since this
> QP does no sending).
>> * The sending XRC QP uses this QP as destination, while
> specifying an XRC SRQ
>> * for actually receiving the transmissions and
> generating all completions on the
>> * receiving side.
>> *
>> * This QP is created in kernel space, and persists
> until the XRC domain is closed.
>> * (i.e., its reference count goes to zero).
>> *
>> * @pd: protection domain to use. At lower layer, this provides
>> access to userspace obj
>> * @xrc_domain: xrc domain to use for the QP.
>> * @attr: modify-qp attributes needed to bring the QP to RTR.
>> * @attr_mask: bitmap indicating which attributes are
> provided in the attr struct.
>> * used for validity checking.
>> * @xrc_rcv_qpn: qp_num of created QP (if success). To be
> passed to the remote node. The
>> * remote node will use xrc_rcv_qpn in
> ibv_post_send when sending to
>> * XRC SRQ's on this host in the same xrc domain.
>> *
>> * RETURNS: success (0), or a (negative) error value.
>> */
>>
>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
>> struct ibv_xrc_domain *xrc_domain,
>> struct ibv_qp_attr *attr,
>> enum ibv_qp_attr_mask attr_mask,
>> uint32_t *xrc_rcv_qpn);
>>
>> Notes:
>>
>> 1. Although the kernel creates the qp in the kernel's own
> PD, we still need the PD
>> parameter to determine the device.
>>
>> 2. I chose to use struct ibv_qp_attr, which is used in
> modify QP, rather than create
>> a new structure for this purpose. This also guards
> against API changes in the event
>> that during development I notice that more modify-qp
> parameters must be specified
>> for this operation to work.
>>
>> 3. Table of the ibv_qp_attr parameters showing what values to set:
>>
>> struct ibv_qp_attr {
>> enum ibv_qp_state qp_state; Not needed
>> enum ibv_qp_state cur_qp_state; Not needed
>> -- Driver starts from RESET and takes qp to RTR.
>> enum ibv_mtu path_mtu; Yes
>> enum ibv_mig_state path_mig_state; Yes
>> uint32_t qkey; Yes
>> uint32_t rq_psn; Yes
>> uint32_t sq_psn; Not needed
>> uint32_t dest_qp_num; Yes
> -- this is the remote side QP for the RC conn.
>> int qp_access_flags; Yes
>> struct ibv_qp_cap cap; Need
> only XRC domain.
>> Other
> caps will use hard-coded values:
>>
> max_send_wr = 1;
>>
> max_recv_wr = 0;
>>
> max_send_sge = 1;
>>
> max_recv_sge = 0;
>>
> max_inline_data = 0;
>> struct ibv_ah_attr ah_attr; Yes
>> struct ibv_ah_attr alt_ah_attr; Optional
>> uint16_t pkey_index; Yes
>> uint16_t alt_pkey_index; Optional
>> uint8_t en_sqd_async_notify; Not
> needed (No sq)
>> uint8_t sq_draining; Not
> needed (No sq)
>> uint8_t max_rd_atomic; Not
> needed (No sq)
>> uint8_t max_dest_rd_atomic; Yes
> -- Total max outstanding RDMAs expected
>> for
> ALL srq destinations using this receive QP.
>> (if
> you are only using SENDs, this value can be 0).
>> uint8_t min_rnr_timer; default - 0
>> uint8_t port_num; Yes
>> uint8_t timeout; Yes
>> uint8_t retry_cnt; Yes
>> uint8_t rnr_retry; Yes
>> uint8_t alt_port_num; Optional
>> uint8_t alt_timeout; Optional
>> };
>>
>> 4. Attribute mask bits to set:
>> For RESET_to_INIT transition:
>> IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT
>>
>> For INIT_to_RTR transition:
>> IB_QP_AV | IB_QP_PATH_MTU |
>> IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
>> If you are using RDMA or atomics, also set:
>> IB_QP_MAX_DEST_RD_ATOMIC
>>
>>
>> _______________________________________________
>> general mailing list
>> general_at_[hidden]
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>>
>
>
> --
> Pavel Shamis (Pasha)
> Mellanox Technologies
>
>
_______________________________________________
general mailing list
general_at_[hidden]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general