Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Using Service Levels (SLs) with OpenMPI 1.6.4 + MLNX_OFED 2.0
From: Mike Dubman (miked_at_[hidden])
Date: 2013-06-11 09:32:31


--mca btl_openib_ib_path_record_**service_level 1 flag controls openib btl,
you need to remove --mca mtl mxm from command line.

Have you compiled OpenMPI with rhel6.4 inbox ofed driver? AFAIK, the MOFED
2.x does not have XRC and you mentioned "--enable-openib-connectx-xrc" flag
in configure.

On Tue, Jun 11, 2013 at 3:02 PM, Jesús Escudero Sahuquillo <
jescudero_at_[hidden]> wrote:

> I have a 16-node Mellanox cluster built with Mellanox ConnectX3 cards.
> Recently I have updated the MLNX_OFED to the 2.0.5 version. The reason of
> this e-mail to the OpenMPI users list is that I am not able to run MPI
> applications using the service levels (SLs) feature of the OpenMPI driver.
>
> Currently, the nodes have the Red-Hat 6.4 with the kernel
> 2.6.32-358.el6.x86_64. I have compiled OpenMPI 1.6.4 with:
>
> ./configure --with-sge --with-openib=/usr --enable-openib-connectx-xrc
> --enable-mpi-thread-multiple --with-threads --with-hwloc
> --enable-heterogeneous --with-fca=/opt/mellanox/fca --with-mxm-libdir=/opt/
> **mellanox/mxm/lib --with-mxm=/opt/mellanox/mxm
> --prefix=/home/jescudero/opt/**openmpi
>
> I have modified the OpenSM code (which is based on 3.3.15) in order to
> include a special routing algorithm based on "ftree". Apparently all is
> correct with the OpenSM since it returns the SLs when I execute the command
> "saquery --src-to-dst slid:dlid". Anyway, I have also tried to run the
> OpenSM with the DFSSSP algorithm.
>
> However, when I try to run MPI applications (i.e. HPCC, OSU or even
> alltoall.c -included in the OpenMPI sources-) I experience some errors if
> the "btl_openib_path_record_info" is set to "1", otherwise (i.e. if the
> btl_openib_path_record_info is not enabled) the application execution ends
> correctly. I run the MPI application with the next command:
>
> mpirun -display-allocation -display-map -np 8 -machinefile maquinas.aux
> --mca btl openib,self,sm --mca mtl mxm --mca btl_openib_ib_path_record_**service_level
> 1 --mca btl_openib_cpc_include oob hpcc
>
> I obtain the next trace:
>
> [nodo20.XXXXX][[31227,1],6][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x16db] errno says: Success [0]
> [nodo15.XXXXX][[31227,1],4][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x1749] errno says: Success [0]
> [nodo17.XXXXX][[31227,1],5][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x1783] errno says: Success [0]
> [nodo21.XXXXX][[31227,1],7][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x1838] errno says: Success [0]
> [nodo21.XXXXX][[31227,1],7][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo17.XXXXX][[31227,1],5][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo15.XXXXX][[31227,1],4][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo20.XXXXX][[31227,1],6][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
>
> Does anyone know what I am doing wrong?
>
> All the best,
>
>
>
>
>
>
> ______________________________**_________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/**mailman/listinfo.cgi/users>
>