Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Job distribution on many-core NUMA system
From: A. Austen (metallurgist_at_[hidden])
Date: 2009-08-28 12:39:52

Hello all.

I apologize if this has been addressed in the FAQ or on the mailing
list, but I spent a fair amount of time searching both and found no
direct answers.

I use OpenMPI, currently version 1.3.2, on an 8-way quad-core AMD
Opteron machine. So 32 cores in total. The computer runs a modern 2.6
family Linux kernel. I don't at the present time use a resource manager
like SLURM, since there is at most one other user and we don't step on
each others' toes.

What I find is that when I launch MPI jobs, I don't see the processes
packed optimally onto the cores. I think OMPI should try to place jobs
in such a way that the tasks fill up all four cores of one socket, then
as many cores as necessary on the next socket, and so on.

So for example, if I want to run 6 tasks, each of which needs 4
processors, I can see that as I start the jobs up, the processes for
each job get distributed without regard to NUMA optimality -- 2 of them
might be on processor A, 1 on processor B, and the fourth on processor
C. Since I have dynamic clocking enabled, I can check this by looking
at /proc/cpuinfo (see what the clock speeds are on each core when the
system is otherwise quiescent), or by using top and turning on the
display for each processor.

Obviously, in terms of maximizing performance, this is bad. Once I
start getting up to say 5 of the 4-processor jobs, I can see
computational throughput degrade heavily. I would hypothesize there is
heavy contention on the HyperTransport links.

I saw the processor and memory affinity options, but that seems to
address a different problem -- namely, keep the jobs pinned to specific
resources. I also want that, but it's not the same issue as I discussed

So, I guess I have several questions:

1. Is there any way to have OpenMPI automatically tell Linux via its
affinity and NUMA-related APIs that the OMPI jobs should be scheduled in
such a way that they fill the cores on particular sockets, and try to
use adjacent sockets?

2. I think the rankfile may be the way for me to address this issue, but
do I need to have a different rankfile for each job? The FAQ shows the
ability to wildcard the "core" number/ID field. Is there a way to
wildcard the socket field, but not the core field, that is tell OMPI I
don't care what socket you choose, but the job should always be mapped
onto the cores of a single socket? The latter might not make sense for
a job using more than the number of cores per socket, but it would be
useful in that case. On a job needing say more than 4 processes on a
quad-core, it probably makes sense to specifically tell OMPI which
sockets to use as well, to try to maintain the smallest number of
processor hops.

3. If my understanding is correct, and a rankfile will help me solve
this problem, can I safely turn on processor and memory affinity such
that the different OMPI jobs I manually launched will not vie for
affinity on the same processor cores/memory chunks?

Thank you.

-- - Same, same, but different...