Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Debugging Runtime/Ethernet Problems
From: Lloyd Brown (lloyd_brown_at_[hidden])
Date: 2013-09-20 10:49:20

Hi, all.

We've got a couple of clusters running RHEL 6.2, and have several
centrally-installed versions/compilations of OpenMPI. Some of the nodes
have 4xQDR Infiniband, and all the nodes have 1 gigabit ethernet. I was
gathering some bandwidth and latency numbers using the OSU/OMB tests,
and noticed some weird behavior.

When I run a simple "mpirun ./osu_bw" on a couple of IB-enabled node, I
get numbers consistent with our IB speed (up to about 3800 MB/s), and
when I run the same thing on two nodes with only Ethernet, I get speeds
consistent with that (up to about 120 MB/s). So far, so good.

The trouble is when I try to add some "--mca" parameters to force it to
use TCP/Ethernet, the program seems to hang. I get the headers of the
"osu_bw" output, but no results, even on the first case (1 byte payload
per packet). This is occurring on both the IB-enabled nodes, and on the
Ethernet-only nodes. The specific syntax I was using was: "mpirun
--mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw"

The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4
compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with
1.6.5 compiled with Intel 13.0.1 compilers. I haven't tested any other
combinations yet.

Any ideas here? It's very possible this is a system configuration
problem, but I don't know where to look. At this point, any ideas would
be welcome, either about the specific situation, or general pointers on
mpirun debugging flags to use. I can't find much in the docs yet on
run-time debugging for OpenMPI, as opposed to debugging the application.
 Maybe I'm just looking in the wrong place.


Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University