Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Issue with QLogic IBA7322 QDR InfiniBand HCA
From: Fabricio Cannini (fcannini_at_[hidden])
Date: 2014-03-31 22:06:31


Hi there

i'm facing a strange issue with this HCA. A cluster I support has been
recently expanded with 4 new nodes, all using the mentioned HCA. 3 nodes
are working fine, but one will not use the IB network when running jobs.
Let's call 'node a' the working one, and 'node b' the not working one.
Here's my scenario :

OS: Rocks Linux 6.1 ( Centos 6.5 x86_64 )

MPI: Stock Centos rpm. 'ompi_info' output below:
package:Open MPI mockbuild_at_[hidden] Distribution
ompi:version:full:1.5.4
ompi:version:svn:r25060
ompi:version:release_date:Aug 18, 2011
orte:version:full:1.5.4
orte:version:svn:r25060
orte:version:release_date:Aug 18, 2011
opal:version:full:1.5.4
opal:version:svn:r25060
opal:version:release_date:Aug 18, 2011
ident:1.5.4

PATH:
/usr/lib64/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin:/opt/rocks/bin:/opt/rocks/sbin:/root/bin

LD_LIBRARY_PATH: /usr/lib64/openmpi/lib

OpenFabrics: Stock centos rpm
libibumad-1.3.8-1.el6.x86_64
libibmad-1.3.9-1.el6.x86_64
libibverbs-utils-1.1.7-1.el6.x86_64
libibverbs-1.1.7-1.el6.x86_64
librdmacm-1.0.17-1.el6.x86_64
infinipath-psm-3.0.1-115.1015_open.2.el6.x86_64

ulimit -l :
'unlimited' in both nodes

Here's where things get interesting. On all nodes with qlogic HCA,
'ibv_devinfo' does not outputs what is expected, only :
        "libibverbs: Warning: no userspace device-specific driver found
          for /sys/class/infiniband_verbs/uverbs0
        No IB devices found"

But i've successfully ran tests on 'node a' , like IMB ping and hello
world, from other working nodes of the cluster, so despite the output of
'ibv_devinfo', 'node a' HCA is working.

I can run 'hello world' from 'node b' to 'node a' without problems,
but the opposite does not work.

So this is my question: why only 'node b' HCA is not working ?
Is there any other tests i can make to get closer to the source of the
problem ?

TIA
Fabricio