I have a very dim recollection of some kernel TCP issues back in some older kernel versions -- such issues affected all TCP communications, not just MPI. Can you try a newer kernel, perchance?
On Mar 30, 2010, at 1:26 PM, <openmpi_at_[hidden]> <openmpi_at_[hidden]> wrote:
> Hello List,
> I hope you can help us out on that one, as we are trying to figure out
> since weeks.
> The situation: We have a program being capable of slitting to several
> processes to be shared on nodes within a cluster network using openmpi.
> We were running that system on "older" cluster hardware (Intel Core2 Duo
> based, 2GB RAM) using an "older" kernel (126.96.36.199). All nodes are
> diskless network booting. Recently we upgraded the hardware (Intel i5,
> 8GB RAM) which also required an upgrade to a recent kernel version
> Here is the problem: We experience overall performance loss on the new
> hardware and think, we can break it down to a communication issue
> inbetween the processes.
> Also, we found out, the issue araises in the transition from kernel
> 2.6.23 to 2.6.24 (tested on the Core2 Duo system).
> Here is an output from our programm:
> 188.8.131.52 (64bit), MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
> 93.33 seconds per iteration.
> Node 0 communication/computation time: 6.83 / 647.64 seconds.
> Node 1 communication/computation time: 10.09 / 644.36 seconds.
> Node 2 communication/computation time: 7.27 / 645.03 seconds.
> Node 3 communication/computation time: 165.02 / 485.52 seconds.
> Node 4 communication/computation time: 6.50 / 643.82 seconds.
> Node 5 communication/computation time: 7.80 / 627.63 seconds.
> Computation time: 897.00 seconds.
> 184.108.40.206 (64bit) .. re-evaluated, MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
> 131.33 seconds per iteration.
> Node 0 communication/computation time: 364.15 / 645.24 seconds.
> Node 1 communication/computation time: 362.83 / 645.26 seconds.
> Node 2 communication/computation time: 349.39 / 645.07 seconds.
> Node 3 communication/computation time: 508.34 / 485.53 seconds.
> Node 4 communication/computation time: 349.94 / 643.81 seconds.
> Node 5 communication/computation time: 349.07 / 627.47 seconds.
> Computation time: 1251.00 seconds.
> The program is 32 bit software, but it doesn't make any difference
> whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was
> tested, cut communication times by half (which still is too high), but
> improvement decreased with increasing kernel version number.
> The communication time is meant to be the time the master process
> distributes the data portions for calculation and collecting the results
> from the slave processes. The value also contains times a slave has to
> wait to communicate with the master as he is occupied. This explains the
> extended communication time of node #3 as the calculation time is
> reduced (based on the nature of the data)
> The command to start the calculation:
> mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np
> 4 -host cluster-18,cluster-19
> Using top (with 'f' and 'j' showing P row) we could track which process
> runs on which core. We found processes stayed on its initial core in
> kernel 2.6.23, but started to flip around with 2.6.24. Using the
> --bind-to-core option in openmpi 1.4.1 kept the processes on its cores
> again, but that didn't influence the overall outcome, didn't fix the issue.
> We found top showing ~25% CPU wait time, and processes showing 'D' ,
> also on slave only nodes. According to our programmer communications are
> only between the master process and its slaves, but not among slaves. On
> kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system
> Example from top:
> Cpu(s): 75.3%us, 0.6%sy, 0.0%ni, 0.0%id, 23.1%wa, 0.7%hi, 0.3%si,
> Mem: 8181236k total, 131224k used, 8050012k free, 0k buffers
> Swap: 0k total, 0k used, 0k free, 49868k cached
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 3386 oli 20 0 90512 20m 3988 R 74 0.3 12:31.80 0 invert-
> 3387 oli 20 0 85072 15m 3780 D 67 0.2 11:59.30 1 invert-
> 3388 oli 20 0 85064 14m 3588 D 77 0.2 12:56.90 2 invert-
> 3389 oli 20 0 84936 14m 3436 R 85 0.2 13:28.30 3 invert-
> Some system information that might be helpful:
> Nodes Hardware:
> 1. "older": Intel Core2 Duo, (2x1)GB RAM
> 2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM
> Debian stable (lenny) distribution with
> ii libc6 2.7-18lenny2
> ii libopenmpi1 1.2.7~rc2-2
> ii openmpi-bin 1.2.7~rc2-2
> ii openmpi-common 1.2.7~rc2-2
> Nodes are booting diskless with nfs-root and a kernel with all drivers
> needed compiled in.
> Information on the program using openmpi and tools used to compile it:
> mpirun --version:
> mpirun (Open MPI) 1.2.7rc2
> libopenmpi-dev 1.2.7~rc2-2
> depends on:
> libc6 (2.7-18lenny2)
> libopenmpi1 (1.2.7~rc2-2)
> openmpi-common (1.2.7~rc2-2)
> Compilation command:
> FORTRAN compiler (FC):
> gfortran --version:
> GNU Fortran (Debian 4.3.2-1.1) 4.3.2
> Called OpenMPI-functions (FORTRAN Bindings):
> Additionally linked libncurses library:
> libncurses5-dev (5.7+20081213-1)
> On remote nodes no calls are ever made to this library. On local nodes
> such calls (coded in C) are only optionally, but usually they are
> skipped too (i.e. even no initscr() is called).
> A signal handler is integrated (coded in C) that reacts specifically on
> SIGTERM and SIGUSR1 signals.
> If you need more information (e.g. kernel config etc.) please ask.
> I hope you can provide some ideas to test and resolve the issue.
> Thanks anyways.
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> users mailing list
For corporate legal information go to: