Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-03-31 09:38:10


I have a very dim recollection of some kernel TCP issues back in some older kernel versions -- such issues affected all TCP communications, not just MPI. Can you try a newer kernel, perchance?

On Mar 30, 2010, at 1:26 PM, <openmpi_at_[hidden]> <openmpi_at_[hidden]> wrote:

> Hello List,
>
> I hope you can help us out on that one, as we are trying to figure out
> since weeks.
>
> The situation: We have a program being capable of slitting to several
> processes to be shared on nodes within a cluster network using openmpi.
> We were running that system on "older" cluster hardware (Intel Core2 Duo
> based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are
> diskless network booting. Recently we upgraded the hardware (Intel i5,
> 8GB RAM) which also required an upgrade to a recent kernel version
> (2.6.26+).
>
> Here is the problem: We experience overall performance loss on the new
> hardware and think, we can break it down to a communication issue
> inbetween the processes.
>
> Also, we found out, the issue araises in the transition from kernel
> 2.6.23 to 2.6.24 (tested on the Core2 Duo system).
>
> Here is an output from our programm:
>
> 2.6.23.17 (64bit), MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
> 93.33 seconds per iteration.
> Node 0 communication/computation time: 6.83 / 647.64 seconds.
> Node 1 communication/computation time: 10.09 / 644.36 seconds.
> Node 2 communication/computation time: 7.27 / 645.03 seconds.
> Node 3 communication/computation time: 165.02 / 485.52 seconds.
> Node 4 communication/computation time: 6.50 / 643.82 seconds.
> Node 5 communication/computation time: 7.80 / 627.63 seconds.
> Computation time: 897.00 seconds.
>
> 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
> 131.33 seconds per iteration.
> Node 0 communication/computation time: 364.15 / 645.24 seconds.
> Node 1 communication/computation time: 362.83 / 645.26 seconds.
> Node 2 communication/computation time: 349.39 / 645.07 seconds.
> Node 3 communication/computation time: 508.34 / 485.53 seconds.
> Node 4 communication/computation time: 349.94 / 643.81 seconds.
> Node 5 communication/computation time: 349.07 / 627.47 seconds.
> Computation time: 1251.00 seconds.
>
> The program is 32 bit software, but it doesn't make any difference
> whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was
> tested, cut communication times by half (which still is too high), but
> improvement decreased with increasing kernel version number.
>
> The communication time is meant to be the time the master process
> distributes the data portions for calculation and collecting the results
> from the slave processes. The value also contains times a slave has to
> wait to communicate with the master as he is occupied. This explains the
> extended communication time of node #3 as the calculation time is
> reduced (based on the nature of the data)
>
> The command to start the calculation:
> mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np
> 4 -host cluster-18,cluster-19
>
> Using top (with 'f' and 'j' showing P row) we could track which process
> runs on which core. We found processes stayed on its initial core in
> kernel 2.6.23, but started to flip around with 2.6.24. Using the
> --bind-to-core option in openmpi 1.4.1 kept the processes on its cores
> again, but that didn't influence the overall outcome, didn't fix the issue.
>
> We found top showing ~25% CPU wait time, and processes showing 'D' ,
> also on slave only nodes. According to our programmer communications are
> only between the master process and its slaves, but not among slaves. On
> kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system
> percentage.
>
> Example from top:
>
> Cpu(s): 75.3%us, 0.6%sy, 0.0%ni, 0.0%id, 23.1%wa, 0.7%hi, 0.3%si,
> 0.0%st
> Mem: 8181236k total, 131224k used, 8050012k free, 0k buffers
> Swap: 0k total, 0k used, 0k free, 49868k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 3386 oli 20 0 90512 20m 3988 R 74 0.3 12:31.80 0 invert-
> 3387 oli 20 0 85072 15m 3780 D 67 0.2 11:59.30 1 invert-
> 3388 oli 20 0 85064 14m 3588 D 77 0.2 12:56.90 2 invert-
> 3389 oli 20 0 84936 14m 3436 R 85 0.2 13:28.30 3 invert-
>
>
> Some system information that might be helpful:
>
> Nodes Hardware:
> 1. "older": Intel Core2 Duo, (2x1)GB RAM
> 2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM
>
> Debian stable (lenny) distribution with
> ii libc6 2.7-18lenny2
> ii libopenmpi1 1.2.7~rc2-2
> ii openmpi-bin 1.2.7~rc2-2
> ii openmpi-common 1.2.7~rc2-2
>
> Nodes are booting diskless with nfs-root and a kernel with all drivers
> needed compiled in.
>
> Information on the program using openmpi and tools used to compile it:
>
> mpirun --version:
> mpirun (Open MPI) 1.2.7rc2
>
> libopenmpi-dev 1.2.7~rc2-2
> depends on:
> libc6 (2.7-18lenny2)
> libopenmpi1 (1.2.7~rc2-2)
> openmpi-common (1.2.7~rc2-2)
>
>
> Compilation command:
> mpif90
>
>
> FORTRAN compiler (FC):
> gfortran --version:
> GNU Fortran (Debian 4.3.2-1.1) 4.3.2
>
>
> Called OpenMPI-functions (FORTRAN Bindings):
> mpi_comm-rank
> mpi_comm_size
>
> mpi_bcast
> mpi_reduce
>
> mpi_isend
> mpi_wait
>
> mpi_send
> mpi_probe
> mpi_recv
>
> MPI_Wtime
>
>
> Additionally linked libncurses library:
> libncurses5-dev (5.7+20081213-1)
> On remote nodes no calls are ever made to this library. On local nodes
> such calls (coded in C) are only optionally, but usually they are
> skipped too (i.e. even no initscr() is called).
>
>
> A signal handler is integrated (coded in C) that reacts specifically on
> SIGTERM and SIGUSR1 signals.
>
>
> If you need more information (e.g. kernel config etc.) please ask.
> I hope you can provide some ideas to test and resolve the issue.
> Thanks anyways.
>
> Oli
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/