Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Using Service Levels (SLs) with OpenMPI 1.6.4 + MLNX_OFED 2.0
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-06-11 10:58:09


Couple of things stand out. You should remove the following configure options:

--enable-mpi-thread-multiple
--with-threads
--enable-heterogeneous

Thread multiple is not ready yet in OMPI (and openib doesn't support threaded operations anyway), and the support for hetero systems really isn't working. Not saying that's the sole source of the problem, but it may well be contributing if you are trying to run a multi-threaded app and it exposes alternative code paths that may not be fully debugged.

On Jun 11, 2013, at 7:40 AM, Jesús Escudero Sahuquillo <jescudero_at_[hidden]> wrote:

> In fact, I also have tried to configure the OpenMPI with this:
>
> ./configure --with-sge --with-openib --enable-mpi-thread-multiple --with-threads --with-hwloc --enable-heterogeneous --disable-vt --enable-openib-dynamic-sl --prefix=/home/jescudero/opt/openmpi
>
> And the problem is still present
>
> El 11/06/13 15:32, Mike Dubman escribió:
>> --mca btl_openib_ib_path_record_service_level 1 flag controls openib btl, you need to remove --mca mtl mxm from command line.
>>
>> Have you compiled OpenMPI with rhel6.4 inbox ofed driver? AFAIK, the MOFED 2.x does not have XRC and you mentioned "--enable-openib-connectx-xrc" flag in configure.
>>
>>
>> On Tue, Jun 11, 2013 at 3:02 PM, Jesús Escudero Sahuquillo <jescudero_at_[hidden]> wrote:
>> I have a 16-node Mellanox cluster built with Mellanox ConnectX3 cards. Recently I have updated the MLNX_OFED to the 2.0.5 version. The reason of this e-mail to the OpenMPI users list is that I am not able to run MPI applications using the service levels (SLs) feature of the OpenMPI driver.
>>
>> Currently, the nodes have the Red-Hat 6.4 with the kernel 2.6.32-358.el6.x86_64. I have compiled OpenMPI 1.6.4 with:
>>
>> ./configure --with-sge --with-openib=/usr --enable-openib-connectx-xrc --enable-mpi-thread-multiple --with-threads --with-hwloc --enable-heterogeneous --with-fca=/opt/mellanox/fca --with-mxm-libdir=/opt/mellanox/mxm/lib --with-mxm=/opt/mellanox/mxm --prefix=/home/jescudero/opt/openmpi
>>
>> I have modified the OpenSM code (which is based on 3.3.15) in order to include a special routing algorithm based on "ftree". Apparently all is correct with the OpenSM since it returns the SLs when I execute the command "saquery --src-to-dst slid:dlid". Anyway, I have also tried to run the OpenSM with the DFSSSP algorithm.
>>
>> However, when I try to run MPI applications (i.e. HPCC, OSU or even alltoall.c -included in the OpenMPI sources-) I experience some errors if the "btl_openib_path_record_info" is set to "1", otherwise (i.e. if the btl_openib_path_record_info is not enabled) the application execution ends correctly. I run the MPI application with the next command:
>>
>> mpirun -display-allocation -display-map -np 8 -machinefile maquinas.aux --mca btl openib,self,sm --mca mtl mxm --mca btl_openib_ib_path_record_service_level 1 --mca btl_openib_cpc_include oob hpcc
>>
>> I obtain the next trace:
>>
>> [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_sl.c:239:get_pathrecord_info] error posting receive on QP [0x16db] errno says: Success [0]
>> [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_sl.c:239:get_pathrecord_info] error posting receive on QP [0x1749] errno says: Success [0]
>> [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_sl.c:239:get_pathrecord_info] error posting receive on QP [0x1783] errno says: Success [0]
>> [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_sl.c:239:get_pathrecord_info] error posting receive on QP [0x1838] errno says: Success [0]
>> [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_oob.c:885:rml_recv_cb] endpoint connect error: -1
>> [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_oob.c:885:rml_recv_cb] endpoint connect error: -1
>> [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_oob.c:885:rml_recv_cb] endpoint connect error: -1
>> [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_oob.c:885:rml_recv_cb] endpoint connect error: -1
>>
>> Does anyone know what I am doing wrong?
>>
>> All the best,
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users