Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2007-12-20 10:14:48

Adding Open MPI and MVAPICH community to the thread.

Pasha (Pavel Shamis)

Jack Morgenstein wrote:
> background: see "XRC Cleanup order issue thread" at
> (userspace process which created the receiving XRC qp on a given host dies before
> other processes which still need to receive XRC messages on their SRQs which are
> "paired" with the now-destroyed receiving XRC QP.)
> Solution: Add a userspace verb (as part of the XRC suite) which enables the user process
> to create an XRC QP owned by the kernel -- which belongs to the required XRC domain.
> This QP will be destroyed when the XRC domain is closed (i.e., as part of a ibv_close_xrc_domain
> call, but only when the domain's reference count goes to zero).
> Below, I give the new userspace API for this function. Any feedback will be appreciated.
> This API will be implemented in the upcoming OFED 1.3 release, so we need feedback ASAP.
> Notes:
> 1. There is no query or destroy verb for this QP. There is also no userspace object for the
> QP. Userspace has ONLY the raw qp number to use when creating the (X)RC connection.
> 2. Since the QP is "owned" by kernel space, async events for this QP are also handled in kernel
> space (i.e., reported in /var/log/messages). There are no completion events for the QP, since
> it does not send, and all receives completions are reported in the XRC SRQ's cq.
> If this QP enters the error state, the remote QP which sends will start receiving RETRY_EXCEEDED
> errors, so the application will be aware of the failure.
> - Jack
> ======================================================================================
> /**
> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as a receive-side only QP,
> * and moves the created qp through the RESET->INIT and INIT->RTR transitions.
> * (The RTR->RTS transition is not needed, since this QP does no sending).
> * The sending XRC QP uses this QP as destination, while specifying an XRC SRQ
> * for actually receiving the transmissions and generating all completions on the
> * receiving side.
> *
> * This QP is created in kernel space, and persists until the XRC domain is closed.
> * (i.e., its reference count goes to zero).
> *
> * @pd: protection domain to use. At lower layer, this provides access to userspace obj
> * @xrc_domain: xrc domain to use for the QP.
> * @attr: modify-qp attributes needed to bring the QP to RTR.
> * @attr_mask: bitmap indicating which attributes are provided in the attr struct.
> * used for validity checking.
> * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to the remote node. The
> * remote node will use xrc_rcv_qpn in ibv_post_send when sending to
> * XRC SRQ's on this host in the same xrc domain.
> *
> * RETURNS: success (0), or a (negative) error value.
> */
> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
> struct ibv_xrc_domain *xrc_domain,
> struct ibv_qp_attr *attr,
> enum ibv_qp_attr_mask attr_mask,
> uint32_t *xrc_rcv_qpn);
> Notes:
> 1. Although the kernel creates the qp in the kernel's own PD, we still need the PD
> parameter to determine the device.
> 2. I chose to use struct ibv_qp_attr, which is used in modify QP, rather than create
> a new structure for this purpose. This also guards against API changes in the event
> that during development I notice that more modify-qp parameters must be specified
> for this operation to work.
> 3. Table of the ibv_qp_attr parameters showing what values to set:
> struct ibv_qp_attr {
> enum ibv_qp_state qp_state; Not needed
> enum ibv_qp_state cur_qp_state; Not needed
> -- Driver starts from RESET and takes qp to RTR.
> enum ibv_mtu path_mtu; Yes
> enum ibv_mig_state path_mig_state; Yes
> uint32_t qkey; Yes
> uint32_t rq_psn; Yes
> uint32_t sq_psn; Not needed
> uint32_t dest_qp_num; Yes -- this is the remote side QP for the RC conn.
> int qp_access_flags; Yes
> struct ibv_qp_cap cap; Need only XRC domain.
> Other caps will use hard-coded values:
> max_send_wr = 1;
> max_recv_wr = 0;
> max_send_sge = 1;
> max_recv_sge = 0;
> max_inline_data = 0;
> struct ibv_ah_attr ah_attr; Yes
> struct ibv_ah_attr alt_ah_attr; Optional
> uint16_t pkey_index; Yes
> uint16_t alt_pkey_index; Optional
> uint8_t en_sqd_async_notify; Not needed (No sq)
> uint8_t sq_draining; Not needed (No sq)
> uint8_t max_rd_atomic; Not needed (No sq)
> uint8_t max_dest_rd_atomic; Yes -- Total max outstanding RDMAs expected
> for ALL srq destinations using this receive QP.
> (if you are only using SENDs, this value can be 0).
> uint8_t min_rnr_timer; default - 0
> uint8_t port_num; Yes
> uint8_t timeout; Yes
> uint8_t retry_cnt; Yes
> uint8_t rnr_retry; Yes
> uint8_t alt_port_num; Optional
> uint8_t alt_timeout; Optional
> };
> 4. Attribute mask bits to set:
> For RESET_to_INIT transition:
> For INIT_to_RTR transition:
> If you are using RDMA or atomics, also set:
> _______________________________________________
> general mailing list
> general_at_[hidden]
> To unsubscribe, please visit

Pavel Shamis (Pasha)
Mellanox Technologies