Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] ´ð¸´: ´ð¸´: ´ð¸´: ´ð¸´: doubt on latency result with OpenMPI library
From: Wang,Yanfei(SYS) (wangyanfei01_at_[hidden])
Date: 2014-03-28 21:07:03

Thanks Jeff!

It's very helpful, I will read all responses of this thread again to deep understand your opinions.


·¢¼þÈË: devel [mailto:devel-bounces_at_[hidden]] ´ú±í Jeff Squyres (jsquyres)
·¢ËÍʱ¼ä: 2014Äê3ÔÂ28ÈÕ 19:18
ÊÕ¼þÈË: Open MPI Developers
Ö÷Ìâ: Re: [OMPI devel] ´ð¸´: ´ð¸´: ´ð¸´: doubt on latency result with OpenMPI library

On Mar 27, 2014, at 11:45 PM, "Wang,Yanfei(SYS)" <wangyanfei01_at_[hidden]> wrote:

> 1. In the RoCE, we cannot use OOB(via tcp socket) for RDMA connection.

More specifically, RoCE QPs can only be made using the RDMA connection manager.

> However, as I known, mellanox HCA supporting RoCE can make rdma and
> tcp/ip work simultaneously. whether some other HCAs can only work on
> RoCE and normal Ethernet individually,

FYI: Mellanox is the only RoCE vendor.

> so that OMPI cannot user OOB(like tcp socket) to build rdma connection except RDMA_CM?

You're mixing two different things: having the ability to run an OS IP stack over a RoCE-capable NIC is orthogonal to whether you can use some out-of-band method to make RoCE RC QPs.

I think you're misunderstanding what OMPI's "oob" QP connection mechanism did. Here's what it did:

1. MPI processes A and B (on different servers) would create half a QP 2. they would then extract the QP connection information from the half-created QP data structures (e.g., the unique QP number) -- A would extra Aa and B would extra Bb 3. A and B would exchange this information 4. A would use Bb to finish creating its QP, and B would use Aa to finish creating its QP. This is a LOCAL operation -- it's effectively just filling in some data structures.
5. Now A and B have fully formed QPs and can use them to send/receive to each other.

The fact that #3 used TCP sockets to exchange information is irrelevant -- you could very well have printed out that information on a screen and hand-typed the information in at the peer.

The only important aspect is that the information had to be exchanged. It doesn't matter whether you use TCP sockets or the actual RDMA CM.

*** Also keep in mind that OMPI's "oob" connection method for IB RC QPs in the openib BTL has been deleted, and has been wholly replaced with the "udcm" connection method (which uses UD QPs for #3, which act very much like UDP datagrams).

For IB, this method of "exchange critical connection information via an out-of-band method" works fine. For RoCE, it's not possible -- there's additional, kernel-level (and possibly hardware-level? I don't know/remember offhand) information that cannot be extracted by userspace and exchanged via an out-of-band method. Hence, you HAVE to use the RDMA CM to make RoCE QPs.

Let me make this totally clear: the fact that you have to use the RCMA CM to make RoCE RC QPs is not an OMPI choice. It's mandated by how the RoCE technology works. IB technology allows the "workaround" of extracting the necessary connection information such that we can use our "udcm" and not RDMA CM.

> I think, If OOB(like tcp) can run simultaneously with ROCE, the rdma connection management would benefit from tcp socket's scalabitly , right?
> 2. Scalability of RDMA_CM.
> Previously I also have few doubts on RDMA_CM ' scalability, when I go deep insight into source code of RDMA_CM library and corresponding kernel module, eg, the shared single QP1 for connection requestion and response, which could introduce severe lock contention if huge rdma connections exist and remote NUMA memory access at multi-core platform; also lots of shared session management data structures which could cause additional contention;
> However, if the connection are not frequently destroyed and rebuilt, does the scalability still have highly dependency on RDMA_CM?
> To get further aware of UDCM, I would like to have a deep understanding on rdma_CM's disadvantage.

You'll have to ask Mellanox / the OpenFabrics community for insights about the RDMA CM. To OMPI, that's the lower layer and we're just a consumer of it.

Keep in mind that the CM is only used during QP connection establishment -- it's not used after that. So if it's a little less efficient, it usually doesn't matter (if it's a LOT less efficient, then it does matter, of course).

Jeff Squyres
For corporate legal information go to:

devel mailing list
Link to this post: