Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Using Service Levels (SLs) with OpenMPI 1.6.4 + MLNX_OFED 2.0
From: Jesús Escudero Sahuquillo (jescudero_at_[hidden])
Date: 2013-06-11 10:40:27


In fact, I also have tried to configure the OpenMPI with this:

./configure --with-sge --with-openib --enable-mpi-thread-multiple
--with-threads --with-hwloc --enable-heterogeneous --disable-vt
--enable-openib-dynamic-sl --prefix=/home/jescudero/opt/openmpi

And the problem is still present

El 11/06/13 15:32, Mike Dubman escribió:
> --mca btl_openib_ib_path_record_service_level 1 flag controls openib
> btl, you need to remove --mca mtl mxm from command line.
>
> Have you compiled OpenMPI with rhel6.4 inbox ofed driver? AFAIK, the
> MOFED 2.x does not have XRC and you mentioned
> "--enable-openib-connectx-xrc" flag in configure.
>
>
> On Tue, Jun 11, 2013 at 3:02 PM, Jesús Escudero Sahuquillo
> <jescudero_at_[hidden] <mailto:jescudero_at_[hidden]>> wrote:
>
> I have a 16-node Mellanox cluster built with Mellanox ConnectX3
> cards. Recently I have updated the MLNX_OFED to the 2.0.5 version.
> The reason of this e-mail to the OpenMPI users list is that I am
> not able to run MPI applications using the service levels (SLs)
> feature of the OpenMPI driver.
>
> Currently, the nodes have the Red-Hat 6.4 with the kernel
> 2.6.32-358.el6.x86_64. I have compiled OpenMPI 1.6.4 with:
>
> ./configure --with-sge --with-openib=/usr
> --enable-openib-connectx-xrc --enable-mpi-thread-multiple
> --with-threads --with-hwloc --enable-heterogeneous
> --with-fca=/opt/mellanox/fca
> --with-mxm-libdir=/opt/mellanox/mxm/lib
> --with-mxm=/opt/mellanox/mxm --prefix=/home/jescudero/opt/openmpi
>
> I have modified the OpenSM code (which is based on 3.3.15) in
> order to include a special routing algorithm based on "ftree".
> Apparently all is correct with the OpenSM since it returns the SLs
> when I execute the command "saquery --src-to-dst slid:dlid".
> Anyway, I have also tried to run the OpenSM with the DFSSSP algorithm.
>
> However, when I try to run MPI applications (i.e. HPCC, OSU or
> even alltoall.c -included in the OpenMPI sources-) I experience
> some errors if the "btl_openib_path_record_info" is set to "1",
> otherwise (i.e. if the btl_openib_path_record_info is not enabled)
> the application execution ends correctly. I run the MPI
> application with the next command:
>
> mpirun -display-allocation -display-map -np 8 -machinefile
> maquinas.aux --mca btl openib,self,sm --mca mtl mxm --mca
> btl_openib_ib_path_record_service_level 1 --mca
> btl_openib_cpc_include oob hpcc
>
> I obtain the next trace:
>
> [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
> error posting receive on QP [0x16db] errno says: Success [0]
> [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
> error posting receive on QP [0x1749] errno says: Success [0]
> [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
> error posting receive on QP [0x1783] errno says: Success [0]
> [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
> error posting receive on QP [0x1838] errno says: Success [0]
> [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
> endpoint connect error: -1
>
> Does anyone know what I am doing wrong?
>
> All the best,
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users