Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] ´ð¸´: ´ð¸´: ´ð¸´: doubt on latency result with OpenMPI library
From: Wang,Yanfei(SYS) (wangyanfei01_at_[hidden])
Date: 2014-03-27 23:45:41


1. In the RoCE, we cannot use OOB(via tcp socket) for RDMA connection.
However, as I known, mellanox HCA supporting RoCE can make rdma and tcp/ip work simultaneously. whether some other HCAs can only work on RoCE and normal Ethernet individually, so that OMPI cannot user OOB(like tcp socket) to build rdma connection except RDMA_CM?

I think, If OOB(like tcp) can run simultaneously with ROCE, the rdma connection management would benefit from tcp socket's scalabitly , right?

2. Scalability of RDMA_CM.
Previously I also have few doubts on RDMA_CM ' scalability, when I go deep insight into source code of RDMA_CM library and corresponding kernel module, eg, the shared single QP1 for connection requestion and response, which could introduce severe lock contention if huge rdma connections exist and remote NUMA memory access at multi-core platform; also lots of shared session management data structures which could cause additional contention;
However, if the connection are not frequently destroyed and rebuilt, does the scalability still have highly dependency on RDMA_CM?
To get further aware of UDCM, I would like to have a deep understanding on rdma_CM's disadvantage.

This thread has a lot of help on OMPI and RDMA transport setting for me, thanks!


·¢¼þÈË: devel [mailto:devel-bounces_at_[hidden]] ´ú±í Jeff Squyres (jsquyres)
·¢ËÍʱ¼ä: 2014Äê3ÔÂ28ÈÕ 0:58
ÊÕ¼þÈË: Open MPI Developers
Ö÷Ìâ: Re: [OMPI devel] ´ð¸´: ´ð¸´: doubt on latency result with OpenMPI library

On Mar 27, 2014, at 11:15 AM, "Wang,Yanfei(SYS)" <wangyanfei01_at_[hidden]> wrote:

> Normally we use rdma-cm to build rdma connection ,then create Qpairs to do rdma data transmit ion, so what is the consideration for separating rdma-cm connection built and data transmit ion at design stage?

There's some history here...

Waaaay back in the day, the only way to make RC verbs connections over IB was to send QP numbers (and other info) out-of-band to a peer (e.g., via TCP sockets). OMPI implemented this method in the openib BTL.

This had some scalability issues, though, so we eventually started experimenting with some other mechanisms for making RC QPs. For example, we tried using the IB connection manager for a while (IBCM), but that ultimately got dropped.

The RDMA Connection Manager was always an option (RDMA CM), but we never bothered to implement it in OMPI until other technologies came along that *required* the use of the RDMA CM, namely: iWARP and RoCE. Meaning: you *can't* make RC QPs over iWARP and RoCE over the OOB method, nor can you use the IB CM -- you *have* to use the RDMA CM.

RDMA CM has its own limitations, though. So for IB RC QPs -- where you don't *have* to use the RDMA CM -- we recently implemented the UDCM, which basically does the same thing as the initial OOB method, but in a more scalable and efficient fashion (I'm leaving out the details; let me know if you want to hear them).

So at different times, we've had different numbers of mechanisms in OMPI for making these connections. In the v1.7/v1.8 tree, I believe that the only 2 left are the RDMA CM and the UDCM.

I also believe that for iWARP and RoCE, the RDMA CM will be chosen automatically, and UD CM will be automatically chosen for IB.

So after all that: I think you shouldn't need to specify the connection manager MCA parameter at all; the openib BTL should choose the Right one for you.

Jeff Squyres
For corporate legal information go to:

devel mailing list
Link to this post: