Subject: Re: [OMPI users] Performance do not scale at all when run jobs on same single node (Rocks, AMD Barcelona, Torque, Maui, Vasp, Openmpi, Gigabit Ethernet)
From: Steven Truong (midair77_at_[hidden])
Date: 2008-02-26 02:51:57

compute-0-0.local np=8 (and not np =4)

Besides, that when we set mpi_paffinity_alone 1, then even though 8
threads were running but the total sum of %CPU was around 400%. For
some reasons, only half of the processing powers of the nodes were
being utilized. The 4 threads of the first job seemed to dominate and
use most of the 400% CPU.

Thank you.

On Mon, Feb 25, 2008 at 11:36 PM, Steven Truong <midair77_at_[hidden]> wrote:
> Dear, all. We just finished installing the first batch of nodes with
> the following configurations.
> Machines: Dual Quad core AMD 2350 + 16 Gig of RAMs
> OS + Apps: Rocks 4.3 + Torque (2.1.8-1) + Maui (3.2.6p19-1) + Openmpi
> (1.1.1-8) + VASP
> Interconnections: Gigabit Ethernet ports + Extreme Summit x450a
> We were able to compile VASP + Openmpi + ACML and ran a bunch of
> tests. However, for all the tests that we ran a _single_ job on ONE
> node (1/2/4/8 core jobs) the performances of VASP jobs scaled well
> like what we expected.
> The problems have surfaced when we tried to run VASP jobs on the same
> node (like 2 4-cores jobs on 1 node) then we would see the performance
> degraded around a factor of 2. A sample VASP 4 cores test run on a
> single node (with no other jobs) would take closed to 900 seconds and
> for this same job, if we ran 2 instances of the same jobs on a single
> node, would would see around 1700-1800 seconds/job. On the compute
> nodes, I used top command and saw that all 8 threads were running
> (~100 %CPU) and the loads were around 8.0 and a little bit up to
> 8.5.
> I thought that processor and/or memory affinity needed to specify:
> #ompi_info | grep affinity
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1)
> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1.1)
> and in my job.txt file for qsub, I modified to include mpi_paffinity_alone:
> ....
> mpiexec --mca mpi_paffinity_alone 1 --np $NPROCS vaspmpi_barcelona
> ....
> However, with or without mpi_paffinity_alone, the performances still
> sucks pretty bad and are not acceptable. With mpi_paffinity_alone
> set, the performances were worse since as we observed with top command
> that some threads were idled a great deal of times. We also tried to
> run jobs without using qsub and PBS and used mpirun directly on the
> nodes, and the performance scaled well like running jobs on an
> isolated node. Weird?? What Torque + Maui could cause such problems?
> I am just wondering, what I have mis-configured my cluster: torque?
> vasp? maui? openmpi? Without the scaling issue, when jobs run with
> qsub and PBS, then things are great.
> My users's .bashrc have these 2 lines:
> export OMP_NUM_THREADS=1
> export LD_LIBRARY_PATH=/opt/acml4.0.1/gfortran64/lib
> and
> # ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> pending signals (-i) 1024
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 4096
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) 135168
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
> My Torque's nodes file has such a simple entry like this:
> compute-0-0.local np=4
> My Maui's setup is a very simple one.
> Please give you advices and suggestions on how to resolve these
> performance issues.
> Thank you very much.