Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem running with RoCE over 10GbE
From: Yevgeny Kliteynik (kliteyn_at_[hidden])
Date: 2011-10-05 09:04:23


Jeff,

On 01-Oct-11 1:01 AM, Konz, Jeffrey (SSA Solution Centers) wrote:
> Encountered a problem when trying to run OpenMPI 1.5.4 with RoCE over 10GbE fabric.
>
> Got this run time error:
>
> An invalid CPC name was specified via the btl_openib_cpc_include MCA
> parameter.
>
> Local host: atl3-14
> btl_openib_cpc_include value: rdmacm
> Invalid name: rdmacm
> All possible valid names: oob,xoob
> --------------------------------------------------------------------------
> [atl3-14:07184] mca: base: components_open: component btl / openib open function failed
> [atl3-12:09178] mca: base: components_open: component btl / openib open function failed
>
> Used these options to mpirun:
> "--mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm -mca btl_openib_if_include mlx4_0:2"
>
> We have a Mellanox LOM with two ports, first is an IB port, second is an 10GbE port.
> Running over the IB port and TCP over the 10GbE port work fine.
>
> Built OpenMPI with this option "--enable-openib-rdmacm".
> Our system has OFED 1.5.2 with librdmacm-1.0.13-1
>
> I noticed this output from configure script:
> checking rdma/rdma_cma.h usability... no
> checking rdma/rdma_cma.h presence... no
> checking for rdma/rdma_cma.h... no
> checking whether IBV_LINK_LAYER_ETHERNET is declared... yes
> checking if RDMAoE support is enabled... yes
> checking for infiniband/driver.h... yes
> checking if ConnectX XRC support is enabled... yes
> checking if dynamic SL is enabled... no
> checking if OpenFabrics RDMACM support is enabled... no
>
> Are we missing a build option or a piece of software?
> Config.log and output from "ompi_info --all" attached.

You shouldn't use the "--enable-openib-rdmacm" option - rdmacm
support is enabled by default, providing librdmacm is found on
the machine.

So the question is, why OMPI config script didn't find it?
OMPI looks for "rdma/rdma_cma.h" header. Do you have it on
you build machine?
The usual location of this file is /usr/include/rdma/rdma_cma.h

Another reason might be this: it appears that OMPI is including
"rdma/rdma_cma.h" rather than <rdma/rdma_cma.h>.

Please apply the following tiny fix to OMPI source:

Index: ompi/config/ompi_check_openib.m4
===================================================================
--- ompi/config/ompi_check_openib.m4 (revision 25228)
+++ ompi/config/ompi_check_openib.m4 (working copy)
@@ -207,7 +207,7 @@
                      [AC_CHECK_LIB([rdmacm], [rdma_create_id],
                          [AC_MSG_CHECKING([for rdma_get_peer_addr])
                          $1_msg=no
- AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include "rdma/rdma_cma.h"
+ AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include <rdma/rdma_cma.h>
                                  ]], [[void *ret = (void*) rdma_get_peer_addr((struct rdma_cm_id*)0);]])],
                              [$1_have_rdmacm=1
                              $1_msg=yes])

Run autogen.sh & configure and check if rdmacm is found.

-- YK

 
> % ibv_devinfo
> hca_id: mlx4_0
> transport: InfiniBand (0)
> fw_ver: 2.9.1000
> node_guid: 78e7:d103:0021:4464
> sys_image_guid: 78e7:d103:0021:4467
> vendor_id: 0x02c9
> vendor_part_id: 26438
> hw_ver: 0xB0
> board_id: HP_0200000003
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 34
> port_lid: 11
> port_lmc: 0x00
> link_layer: IB
>
> port: 2
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
>
> % /sbin/ifconfig
> eth0 Link encap:Ethernet HWaddr 78:E7:D1:21:44:60
> inet addr:16.113.180.147 Bcast:16.113.183.255 Mask:255.255.252.0
> inet6 addr: fe80::7ae7:d1ff:fe21:4460/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:1861763 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1776402 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:712448939 (679.4 MiB) TX bytes:994111004 (948.0 MiB)
> Memory:fb9e0000-fba00000
>
> eth2 Link encap:Ethernet HWaddr 78:E7:D1:21:44:65
> inet addr:10.10.0.147 Bcast:10.10.0.255 Mask:255.255.255.0
> inet6 addr: fe80::78e7:d100:121:4465/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:8519814 errors:0 dropped:0 overruns:0 frame:0
> TX packets:8555715 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:12370127778 (11.5 GiB) TX bytes:12372246315 (11.5 GiB)
>
> ib0 Link encap:InfiniBand HWaddr 80:00:00:4D:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> inet addr:192.168.0.147 Bcast:192.168.0.255 Mask:255.255.255.0
> inet6 addr: fe80::7ae7:d103:21:4465/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:16384 Metric:1
> RX packets:1989 errors:0 dropped:0 overruns:0 frame:0
> TX packets:208 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:256
> RX bytes:275196 (268.7 KiB) TX bytes:19202 (18.7 KiB)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:42224 errors:0 dropped:0 overruns:0 frame:0
> TX packets:42224 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:3115668 (2.9 MiB) TX bytes:3115668 (2.9 MiB)
>
> Thanks,
>
> -Jeff
>
>
> /**********************************************************/
> /* Jeff Konz jeffrey.konz_at_[hidden] */
> /* Solutions Architect HPC Benchmarking */
> /* Americas Shared Solutions Architecture (SSA) */
> /* Hewlett-Packard Company */
> /* Office: 248-491-7480 Mobile: 248-345-6857 */
> /**********************************************************/
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users