Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Using Service Levels (SLs) with OpenMPI 1.6.4 + MLNX_OFED 2.0
From: Jesús Escudero Sahuquillo (jescudero_at_[hidden])
Date: 2013-06-11 08:02:21


I have a 16-node Mellanox cluster built with Mellanox ConnectX3 cards.
Recently I have updated the MLNX_OFED to the 2.0.5 version. The reason
of this e-mail to the OpenMPI users list is that I am not able to run
MPI applications using the service levels (SLs) feature of the OpenMPI
driver.

Currently, the nodes have the Red-Hat 6.4 with the kernel
2.6.32-358.el6.x86_64. I have compiled OpenMPI 1.6.4 with:

  ./configure --with-sge --with-openib=/usr --enable-openib-connectx-xrc
--enable-mpi-thread-multiple --with-threads --with-hwloc
--enable-heterogeneous --with-fca=/opt/mellanox/fca
--with-mxm-libdir=/opt/mellanox/mxm/lib --with-mxm=/opt/mellanox/mxm
--prefix=/home/jescudero/opt/openmpi

I have modified the OpenSM code (which is based on 3.3.15) in order to
include a special routing algorithm based on "ftree". Apparently all is
correct with the OpenSM since it returns the SLs when I execute the
command "saquery --src-to-dst slid:dlid". Anyway, I have also tried to
run the OpenSM with the DFSSSP algorithm.

However, when I try to run MPI applications (i.e. HPCC, OSU or even
alltoall.c -included in the OpenMPI sources-) I experience some errors
if the "btl_openib_path_record_info" is set to "1", otherwise (i.e. if
the btl_openib_path_record_info is not enabled) the application
execution ends correctly. I run the MPI application with the next command:

mpirun -display-allocation -display-map -np 8 -machinefile maquinas.aux
--mca btl openib,self,sm --mca mtl mxm --mca
btl_openib_ib_path_record_service_level 1 --mca btl_openib_cpc_include
oob hpcc

I obtain the next trace:

[nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
error posting receive on QP [0x16db] errno says: Success [0]
[nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
error posting receive on QP [0x1749] errno says: Success [0]
[nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
error posting receive on QP [0x1783] errno says: Success [0]
[nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
error posting receive on QP [0x1838] errno says: Success [0]
[nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
endpoint connect error: -1
[nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
endpoint connect error: -1
[nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
endpoint connect error: -1
[nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
endpoint connect error: -1

Does anyone know what I am doing wrong?

All the best,