Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] infiniband with MPI
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-07-31 10:27:28

On Jul 31, 2012, at 12:14 AM, Joen Chen wrote:

> After reading the FAQ about OFED, I knew that openMPI can collaborate with RoCE.

Correct -- Open MPI can use RoCE interfaces, if they are available.

> Moreover, using the RoCE make some overhead because the underlying network layers. In my infiniband bandwidth testing, I get the 5Gbps using IPoIB and 12Gbps using RDMA. The performance gap is huge for my application.

I'm not sure how you get 12Gbps -- RoCE interfaces are 10Gbps per port, aren't they? :-)

> My question is: Could the OpenMPI use the RDMA raw api not via network layer?!

I think you're mixing terminology here.

- IPoIB: an emulated TCP/IP stack over OpenFabrics devices (to include InfiniBand and RoCE devices). Emulating the TCP/IP stack leads to quite a bit of overhead (compared to raw verbs mode), and results in significantly lower performance. Open MPI will use the TCP BTL to access such devices.

- Verbs: the "raw" or "native" API to access OpenFabrics devices (to include InfiniBand and RoCE devices). This mode is much higher performance than IPoIB. Open MPI will use the "openib" BTL to access such devices.

For a RoCE device, you can have it configured to simultaneously support both IPoIB and Verbs modes. Hence, you can have some applications using IPoIB and others using Verbs.

Open MPI will prefer using the verbs mode (and will ignore the IPoIB mode if the verbs mode is enabled).

Hence, if you use Open MPI 1.6.x, it should automatically default to using the openib BTL over your RoCE devices, which should result in higher performance than the IPoIB mode, or even native TCP/IP mode (IIRC -- but I'm not 100% sure of this -- since RoCE devices are Ethernet devices, they can run Linux's raw TCP/IP stack, too, not just the IPoIB stack).

Can you explain how you got your 5Gpbs and 12Gbps numbers? I'm already suspicious of your testing methodology because 12Gpbs over a single RoCE port just isn't possible... :-)

Jeff Squyres
For corporate legal information go to: