Big topic and actually the subject of much recent discussion. Here are
a few comments:
1) "Optimally" depends on what you're doing. A big issue is making
sure each MPI process gets as much memory bandwidth (and cache and other
shared resources) as possible. This would argue that processes
*should* be spread over as many sockets as possible. And, indeed, some
MPIs default to this behavior. It depends on lots of things, including
how much of the machine you're using.
2) Currently (1.3.2), there is rankfile support. This is probably a
little bit more gruesome than you hope for. E.g., if you have multiple
jobs, you need to custom tailor the rankfile for each. Another heavy
hammer might be to write scripts that, depending on job and process rank
and stuff, launches the MPI process using numactl. I'm not convinced
you want to go that route, but at some level it offers you the ability
to do what you're asking for.
3) Soon, (1.3.4?, or use the trunk) there should be some richer support
including bind-to-socket, bind-to-core, etc. I happen to like
bind-to-socket. Sounds like you like bind-to-core. Ralph's putbacks
should make each of us happy. But if multiple jobs are being launched,
you might still not yet like the extent of the functionality.
4) The default behavior of the OS may depend on the OS, the BIOS (which
numbers the cores), etc.
Caveat: this note is hastily written with fuzzy knowledge of the status
of all the subissues. Just a quick message to start what I think will
in any case be a long e-mail thread.
A. Austen wrote:
>I apologize if this has been addressed in the FAQ or on the mailing
>list, but I spent a fair amount of time searching both and found no
>I use OpenMPI, currently version 1.3.2, on an 8-way quad-core AMD
>Opteron machine. So 32 cores in total. The computer runs a modern 2.6
>family Linux kernel. I don't at the present time use a resource manager
>like SLURM, since there is at most one other user and we don't step on
>each others' toes.
>What I find is that when I launch MPI jobs, I don't see the processes
>packed optimally onto the cores. I think OMPI should try to place jobs
>in such a way that the tasks fill up all four cores of one socket, then
>as many cores as necessary on the next socket, and so on.
>So for example, if I want to run 6 tasks, each of which needs 4
>processors, I can see that as I start the jobs up, the processes for
>each job get distributed without regard to NUMA optimality -- 2 of them
>might be on processor A, 1 on processor B, and the fourth on processor
>C. Since I have dynamic clocking enabled, I can check this by looking
>at /proc/cpuinfo (see what the clock speeds are on each core when the
>system is otherwise quiescent), or by using top and turning on the
>display for each processor.
>Obviously, in terms of maximizing performance, this is bad. Once I
>start getting up to say 5 of the 4-processor jobs, I can see
>computational throughput degrade heavily. I would hypothesize there is
>heavy contention on the HyperTransport links.
>I saw the processor and memory affinity options, but that seems to
>address a different problem -- namely, keep the jobs pinned to specific
>resources. I also want that, but it's not the same issue as I discussed
>So, I guess I have several questions:
>1. Is there any way to have OpenMPI automatically tell Linux via its
>affinity and NUMA-related APIs that the OMPI jobs should be scheduled in
>such a way that they fill the cores on particular sockets, and try to
>use adjacent sockets?
>2. I think the rankfile may be the way for me to address this issue, but
>do I need to have a different rankfile for each job? The FAQ shows the
>ability to wildcard the "core" number/ID field. Is there a way to
>wildcard the socket field, but not the core field, that is tell OMPI I
>don't care what socket you choose, but the job should always be mapped
>onto the cores of a single socket? The latter might not make sense for
>a job using more than the number of cores per socket, but it would be
>useful in that case. On a job needing say more than 4 processes on a
>quad-core, it probably makes sense to specifically tell OMPI which
>sockets to use as well, to try to maintain the smallest number of
>3. If my understanding is correct, and a rankfile will help me solve
>this problem, can I safely turn on processor and memory affinity such
>that the different OMPI jobs I manually launched will not vie for
>affinity on the same processor cores/memory chunks?