Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2007-12-20 10:14:48


Adding Open MPI and MVAPICH community to the thread.

Pasha (Pavel Shamis)

Jack Morgenstein wrote:
> background: see "XRC Cleanup order issue thread" at
>
> http://lists.openfabrics.org/pipermail/general/2007-December/043935.html
>
> (userspace process which created the receiving XRC qp on a given host dies before
> other processes which still need to receive XRC messages on their SRQs which are
> "paired" with the now-destroyed receiving XRC QP.)
>
> Solution: Add a userspace verb (as part of the XRC suite) which enables the user process
> to create an XRC QP owned by the kernel -- which belongs to the required XRC domain.
>
> This QP will be destroyed when the XRC domain is closed (i.e., as part of a ibv_close_xrc_domain
> call, but only when the domain's reference count goes to zero).
>
> Below, I give the new userspace API for this function. Any feedback will be appreciated.
> This API will be implemented in the upcoming OFED 1.3 release, so we need feedback ASAP.
>
> Notes:
> 1. There is no query or destroy verb for this QP. There is also no userspace object for the
> QP. Userspace has ONLY the raw qp number to use when creating the (X)RC connection.
>
> 2. Since the QP is "owned" by kernel space, async events for this QP are also handled in kernel
> space (i.e., reported in /var/log/messages). There are no completion events for the QP, since
> it does not send, and all receives completions are reported in the XRC SRQ's cq.
>
> If this QP enters the error state, the remote QP which sends will start receiving RETRY_EXCEEDED
> errors, so the application will be aware of the failure.
>
> - Jack
> ======================================================================================
> /**
> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as a receive-side only QP,
> * and moves the created qp through the RESET->INIT and INIT->RTR transitions.
> * (The RTR->RTS transition is not needed, since this QP does no sending).
> * The sending XRC QP uses this QP as destination, while specifying an XRC SRQ
> * for actually receiving the transmissions and generating all completions on the
> * receiving side.
> *
> * This QP is created in kernel space, and persists until the XRC domain is closed.
> * (i.e., its reference count goes to zero).
> *
> * @pd: protection domain to use. At lower layer, this provides access to userspace obj
> * @xrc_domain: xrc domain to use for the QP.
> * @attr: modify-qp attributes needed to bring the QP to RTR.
> * @attr_mask: bitmap indicating which attributes are provided in the attr struct.
> * used for validity checking.
> * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to the remote node. The
> * remote node will use xrc_rcv_qpn in ibv_post_send when sending to
> * XRC SRQ's on this host in the same xrc domain.
> *
> * RETURNS: success (0), or a (negative) error value.
> */
>
> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
> struct ibv_xrc_domain *xrc_domain,
> struct ibv_qp_attr *attr,
> enum ibv_qp_attr_mask attr_mask,
> uint32_t *xrc_rcv_qpn);
>
> Notes:
>
> 1. Although the kernel creates the qp in the kernel's own PD, we still need the PD
> parameter to determine the device.
>
> 2. I chose to use struct ibv_qp_attr, which is used in modify QP, rather than create
> a new structure for this purpose. This also guards against API changes in the event
> that during development I notice that more modify-qp parameters must be specified
> for this operation to work.
>
> 3. Table of the ibv_qp_attr parameters showing what values to set:
>
> struct ibv_qp_attr {
> enum ibv_qp_state qp_state; Not needed
> enum ibv_qp_state cur_qp_state; Not needed
> -- Driver starts from RESET and takes qp to RTR.
> enum ibv_mtu path_mtu; Yes
> enum ibv_mig_state path_mig_state; Yes
> uint32_t qkey; Yes
> uint32_t rq_psn; Yes
> uint32_t sq_psn; Not needed
> uint32_t dest_qp_num; Yes -- this is the remote side QP for the RC conn.
> int qp_access_flags; Yes
> struct ibv_qp_cap cap; Need only XRC domain.
> Other caps will use hard-coded values:
> max_send_wr = 1;
> max_recv_wr = 0;
> max_send_sge = 1;
> max_recv_sge = 0;
> max_inline_data = 0;
> struct ibv_ah_attr ah_attr; Yes
> struct ibv_ah_attr alt_ah_attr; Optional
> uint16_t pkey_index; Yes
> uint16_t alt_pkey_index; Optional
> uint8_t en_sqd_async_notify; Not needed (No sq)
> uint8_t sq_draining; Not needed (No sq)
> uint8_t max_rd_atomic; Not needed (No sq)
> uint8_t max_dest_rd_atomic; Yes -- Total max outstanding RDMAs expected
> for ALL srq destinations using this receive QP.
> (if you are only using SENDs, this value can be 0).
> uint8_t min_rnr_timer; default - 0
> uint8_t port_num; Yes
> uint8_t timeout; Yes
> uint8_t retry_cnt; Yes
> uint8_t rnr_retry; Yes
> uint8_t alt_port_num; Optional
> uint8_t alt_timeout; Optional
> };
>
> 4. Attribute mask bits to set:
> For RESET_to_INIT transition:
> IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT
>
> For INIT_to_RTR transition:
> IB_QP_AV | IB_QP_PATH_MTU |
> IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
> If you are using RDMA or atomics, also set:
> IB_QP_MAX_DEST_RD_ATOMIC
>
>
> _______________________________________________
> general mailing list
> general_at_[hidden]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>

-- 
Pavel Shamis (Pasha)
Mellanox Technologies