Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Issue with QLogic IBA7322 QDR InfiniBand HCA
From: Fabricio Cannini (fcannini_at_[hidden])
Date: 2014-03-31 22:06:31

Hi there

i'm facing a strange issue with this HCA. A cluster I support has been
recently expanded with 4 new nodes, all using the mentioned HCA. 3 nodes
are working fine, but one will not use the IB network when running jobs.
Let's call 'node a' the working one, and 'node b' the not working one.
Here's my scenario :

OS: Rocks Linux 6.1 ( Centos 6.5 x86_64 )

MPI: Stock Centos rpm. 'ompi_info' output below:
package:Open MPI mockbuild_at_[hidden] Distribution
ompi:version:release_date:Aug 18, 2011
orte:version:release_date:Aug 18, 2011
opal:version:release_date:Aug 18, 2011


LD_LIBRARY_PATH: /usr/lib64/openmpi/lib

OpenFabrics: Stock centos rpm

ulimit -l :
'unlimited' in both nodes

Here's where things get interesting. On all nodes with qlogic HCA,
'ibv_devinfo' does not outputs what is expected, only :
        "libibverbs: Warning: no userspace device-specific driver found
          for /sys/class/infiniband_verbs/uverbs0
        No IB devices found"

But i've successfully ran tests on 'node a' , like IMB ping and hello
world, from other working nodes of the cluster, so despite the output of
'ibv_devinfo', 'node a' HCA is working.

I can run 'hello world' from 'node b' to 'node a' without problems,
but the opposite does not work.

So this is my question: why only 'node b' HCA is not working ?
Is there any other tests i can make to get closer to the source of the
problem ?