It sounds like you are running into an issue with the Linux scheduler. I have an item to add an API "bind-this-thread-to-<core, socket,
>", but that won't be available until sometime in the future.
Couple of things you could try in the meantime. First, use the --cpus-per-rank option to separate the ranks from each other. In other words, instead of --bind-to-socket -bysocket, you do:
-bind-to-core -cpus-per-rank N
This will take each rank and bind it to a unique set of N cores, thereby cleanly separating them on the node.
Second, the Linux scheduler tends to become jealous of the way MPI procs "hog" the resources. The scheduler needs room to run all those daemons and other processes too. So it tends to squeeze you aside a little, just to create some room for the rest of the stuff.
What you can do is "entice" it away from your processes by leaving 1-2 cores for its own use. For example:
-npernode 2 -bind-to-core -cpus-per-rank 3
would run two MPI ranks on each node, each rank exclusively bound to 3 cores. This leaves 2 cores on each node for Linux. When the scheduler sees the 6 cores of your MPI/MP procs working hard, and 2 cores sitting idle, it will tend to use those 2 cores for everything else - and not be tempted to push you aside to gain access to "your" cores.
On Feb 29, 2012, at 3:08 AM, Auclair Francis wrote:
> Dear Open-MPI users,
> Our code is currently running Open-MPI (1.5.4) with SLURM on a NUMA machine (2 sockets by nodes and 4 cores by socket) with basically two
> levels of implementation for Open-MPI:
> - at lower level n "Master" MPI-processes (one by socket) are
> simultaneously runned by dividing classically the physical domain into n
> - while at higher level 4n MPI-processes are spawn to run a sparse Poisson solver.
> At each time step, the code is thus going back and forth between these two levels of implementation using two MPI communicators. This also means that during about half of the computation time, 3n cores are at best sleeping (if not 'waiting' at a barrier) when not inside "Solver routines". We consequently decided to implement OpenMP functionality in our code when solver was not running (we declare one single "parallel" region and use the omp "master" command when OpenMP threads are not active). We however face several difficulties:
> a) It seems that both the 3n-MPI processes and the OpenMP threads 'consume processor cycles while waiting'. We consequently tried: mpirun
> -mpi_yield_when_idle 1 , export OMP_WAIT_POLICY=passive or export
> KMP_BLOCKTIME=0 ... The latest finally leads to an interesting reduction
> of computing time but worsens the second problem we have to face (see
> b) We managed to have a "correct" (?) implementation of our MPI-processes
> on our sockets by using: mpirun -bind-to-socket -bysocket -np 4n
> However if OpenMP threads initially seem to scatter on each socket (one
> thread by core) they slowly migrate to the same core as their 'Master MPI process' or gather on one or two cores by socket
> We play around with the environment variable KMP_AFFINITY but the best we could obtain was a pinning of the OpenMP threads to their own core... disorganizing at the same time the implementation of the 4n Level-2 MPI processes. When added, neither the specification of a rankfile nor the mpirun option -x IPATH_NO_CPUAFFINITY=1 seem to change significantly the situation.
> This comportment looks rather inefficient but so far we did not manage to prevent the migration of the 4 threads to at most a couple of cores !
> Is there something wrong in our "Hybrid" implementation?
> Do you have any advices?
> Thanks for your help,
> users mailing list