We've got a couple of clusters running RHEL 6.2, and have several
centrally-installed versions/compilations of OpenMPI. Some of the nodes
have 4xQDR Infiniband, and all the nodes have 1 gigabit ethernet. I was
gathering some bandwidth and latency numbers using the OSU/OMB tests,
and noticed some weird behavior.
When I run a simple "mpirun ./osu_bw" on a couple of IB-enabled node, I
get numbers consistent with our IB speed (up to about 3800 MB/s), and
when I run the same thing on two nodes with only Ethernet, I get speeds
consistent with that (up to about 120 MB/s). So far, so good.
The trouble is when I try to add some "--mca" parameters to force it to
use TCP/Ethernet, the program seems to hang. I get the headers of the
"osu_bw" output, but no results, even on the first case (1 byte payload
per packet). This is occurring on both the IB-enabled nodes, and on the
Ethernet-only nodes. The specific syntax I was using was: "mpirun
--mca btl ^openib --mca btl_tcp_if_exclude ib0 ./osu_bw"
The problem occurs at least with OpenMPI 1.6.3 compiled with GNU 4.4
compilers, with 1.6.3 compiled with Intel 13.0.1 compilers, and with
1.6.5 compiled with Intel 13.0.1 compilers. I haven't tested any other
Any ideas here? It's very possible this is a system configuration
problem, but I don't know where to look. At this point, any ideas would
be welcome, either about the specific situation, or general pointers on
mpirun debugging flags to use. I can't find much in the docs yet on
run-time debugging for OpenMPI, as opposed to debugging the application.
Maybe I'm just looking in the wrong place.
Fulton Supercomputing Lab
Brigham Young University