From: Troy Telford (ttelford_at_[hidden])
Date: 2005-10-21 18:03:56

I've been trying out the RC4 builds of OpenMPI; I've been using Myrinet
(gm), Infiniband (mvapi), and TCP.

When running a benchmark such as IMB (formerly PALLAS, IIRC), or even a
simple hello world, there are no problems.

However, when running HPL (and HPCC, which is a superset of HPL), I have
run into a problem: When running HPL (or when the execution reaches the
HPL portion of HPCC), the process seems to get wedged...

I have no problems compiling and building HPL and HPCC for MPICH variants
( including MVAPICH, MPICH-GM/MX) and LAM; no problems with the gcc,
Intel, PGI, or Pathscale compilers.

The HPL.dat (and hpccinf.txt) can be identical across the machines. The
machines are identically configured (except for the interconnect).

However, when running the HPL code (on OpenMPI), HPL will peg the CPUs,
and run until I feel like killing it.. If the 'N' size is larger than a
fraction of a percent of free system memory (0.1% of free memory; system
has 2 GB/CPU, in my case), HPL and HPCC will not finish computing that
problem size. (Case in point -- a N size that is small enough that it
takes 1-2 seconds with MPICH, MPICH-GM, MVAPICH, or LAM -- doesn't
complete after several minutes on OpenMPI)

I'm therefore, somewhat confused; I've seen posts from people who claim to
have run HPL with OpenMPI. I've had no issues running other benchmarks on
OpenMPI; but HPL-based code seems to wedge itself... The behavior is
consistent when I use Myrinet, Infiniband, or Ethernet.

I am running OpenMPI on Linux (SuSE Enterprise 9, SP2, x86_64).
Dual-Opteron 248; 2 GB/CPU