Subject: [OMPI users] Performance do not scale at all when run jobs on same single node (Rocks, AMD Barcelona, Torque, Maui, Vasp, Openmpi, Gigabit Ethernet)
From: Steven Truong (midair77_at_[hidden])
Date: 2008-02-26 02:36:38

Dear, all. We just finished installing the first batch of nodes with
the following configurations.
Machines: Dual Quad core AMD 2350 + 16 Gig of RAMs
OS + Apps: Rocks 4.3 + Torque (2.1.8-1) + Maui (3.2.6p19-1) + Openmpi
(1.1.1-8) + VASP
Interconnections: Gigabit Ethernet ports + Extreme Summit x450a

We were able to compile VASP + Openmpi + ACML and ran a bunch of
tests. However, for all the tests that we ran a _single_ job on ONE
node (1/2/4/8 core jobs) the performances of VASP jobs scaled well
like what we expected.

The problems have surfaced when we tried to run VASP jobs on the same
node (like 2 4-cores jobs on 1 node) then we would see the performance
degraded around a factor of 2. A sample VASP 4 cores test run on a
single node (with no other jobs) would take closed to 900 seconds and
for this same job, if we ran 2 instances of the same jobs on a single
node, would would see around 1700-1800 seconds/job. On the compute
nodes, I used top command and saw that all 8 threads were running
(~100 %CPU) and the loads were around 8.0 and a little bit up to

I thought that processor and/or memory affinity needed to specify:
 #ompi_info | grep affinity
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1)
           MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1)
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1.1)

and in my job.txt file for qsub, I modified to include mpi_paffinity_alone:
mpiexec --mca mpi_paffinity_alone 1 --np $NPROCS vaspmpi_barcelona

However, with or without mpi_paffinity_alone, the performances still
sucks pretty bad and are not acceptable. With mpi_paffinity_alone
set, the performances were worse since as we observed with top command
that some threads were idled a great deal of times. We also tried to
run jobs without using qsub and PBS and used mpirun directly on the
nodes, and the performance scaled well like running jobs on an
isolated node. Weird?? What Torque + Maui could cause such problems?

I am just wondering, what I have mis-configured my cluster: torque?
vasp? maui? openmpi? Without the scaling issue, when jobs run with
qsub and PBS, then things are great.

My users's .bashrc have these 2 lines:
export LD_LIBRARY_PATH=/opt/acml4.0.1/gfortran64/lib


# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 135168
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

My Torque's nodes file has such a simple entry like this:

compute-0-0.local np=4

My Maui's setup is a very simple one.

Please give you advices and suggestions on how to resolve these
performance issues.

Thank you very much.