Subject: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
From: openmpi_at_[hidden]
Date: 2010-03-30 13:26:52

Hello List,

I hope you can help us out on that one, as we are trying to figure out
since weeks.

The situation: We have a program being capable of slitting to several
processes to be shared on nodes within a cluster network using openmpi.
We were running that system on "older" cluster hardware (Intel Core2 Duo
based, 2GB RAM) using an "older" kernel ( All nodes are
diskless network booting. Recently we upgraded the hardware (Intel i5,
8GB RAM) which also required an upgrade to a recent kernel version

Here is the problem: We experience overall performance loss on the new
hardware and think, we can break it down to a communication issue
inbetween the processes.

Also, we found out, the issue araises in the transition from kernel
2.6.23 to 2.6.24 (tested on the Core2 Duo system).

Here is an output from our programm: (64bit), MPI 1.2.7
5 Iterationen (Core2 Duo) 6 CPU:
    93.33 seconds per iteration.
 Node 0 communication/computation time: 6.83 / 647.64 seconds.
 Node 1 communication/computation time: 10.09 / 644.36 seconds.
 Node 2 communication/computation time: 7.27 / 645.03 seconds.
 Node 3 communication/computation time: 165.02 / 485.52 seconds.
 Node 4 communication/computation time: 6.50 / 643.82 seconds.
 Node 5 communication/computation time: 7.80 / 627.63 seconds.
 Computation time: 897.00 seconds. (64bit) .. re-evaluated, MPI 1.2.7
5 Iterationen (Core2 Duo) 6 CPU:
   131.33 seconds per iteration.
 Node 0 communication/computation time: 364.15 / 645.24 seconds.
 Node 1 communication/computation time: 362.83 / 645.26 seconds.
 Node 2 communication/computation time: 349.39 / 645.07 seconds.
 Node 3 communication/computation time: 508.34 / 485.53 seconds.
 Node 4 communication/computation time: 349.94 / 643.81 seconds.
 Node 5 communication/computation time: 349.07 / 627.47 seconds.
 Computation time: 1251.00 seconds.

The program is 32 bit software, but it doesn't make any difference
whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was
tested, cut communication times by half (which still is too high), but
improvement decreased with increasing kernel version number.

The communication time is meant to be the time the master process
distributes the data portions for calculation and collecting the results
from the slave processes. The value also contains times a slave has to
wait to communicate with the master as he is occupied. This explains the
extended communication time of node #3 as the calculation time is
reduced (based on the nature of the data)

The command to start the calculation:
mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np
4 -host cluster-18,cluster-19

Using top (with 'f' and 'j' showing P row) we could track which process
runs on which core. We found processes stayed on its initial core in
kernel 2.6.23, but started to flip around with 2.6.24. Using the
--bind-to-core option in openmpi 1.4.1 kept the processes on its cores
again, but that didn't influence the overall outcome, didn't fix the issue.

We found top showing ~25% CPU wait time, and processes showing 'D' ,
also on slave only nodes. According to our programmer communications are
only between the master process and its slaves, but not among slaves. On
kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system

Example from top:

Cpu(s): 75.3%us, 0.6%sy, 0.0%ni, 0.0%id, 23.1%wa, 0.7%hi, 0.3%si,
Mem: 8181236k total, 131224k used, 8050012k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 49868k cached

 3386 oli 20 0 90512 20m 3988 R 74 0.3 12:31.80 0 invert-
 3387 oli 20 0 85072 15m 3780 D 67 0.2 11:59.30 1 invert-
 3388 oli 20 0 85064 14m 3588 D 77 0.2 12:56.90 2 invert-
 3389 oli 20 0 84936 14m 3436 R 85 0.2 13:28.30 3 invert-

Some system information that might be helpful:

Nodes Hardware:
1. "older": Intel Core2 Duo, (2x1)GB RAM
2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM

Debian stable (lenny) distribution with
ii libc6 2.7-18lenny2
ii libopenmpi1 1.2.7~rc2-2
ii openmpi-bin 1.2.7~rc2-2
ii openmpi-common 1.2.7~rc2-2

Nodes are booting diskless with nfs-root and a kernel with all drivers
needed compiled in.

Information on the program using openmpi and tools used to compile it:

mpirun --version:
mpirun (Open MPI) 1.2.7rc2

libopenmpi-dev 1.2.7~rc2-2
depends on:
 libc6 (2.7-18lenny2)
 libopenmpi1 (1.2.7~rc2-2)
 openmpi-common (1.2.7~rc2-2)

Compilation command:

FORTRAN compiler (FC):
gfortran --version:
GNU Fortran (Debian 4.3.2-1.1) 4.3.2

Called OpenMPI-functions (FORTRAN Bindings):





Additionally linked libncurses library:
libncurses5-dev (5.7+20081213-1)
On remote nodes no calls are ever made to this library. On local nodes
such calls (coded in C) are only optionally, but usually they are
skipped too (i.e. even no initscr() is called).

A signal handler is integrated (coded in C) that reacts specifically on
SIGTERM and SIGUSR1 signals.

If you need more information (e.g. kernel config etc.) please ask.
I hope you can provide some ideas to test and resolve the issue.
Thanks anyways.


