Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi_paffinity_alone and Nehalem SMT
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-10-23 12:27:51


Noam Bernstein wrote:

> Hi all - we have a new Nehalem cluster (dual quad core), and SMT is
> enabled in the BIOS (for now). I do want to do benchmarking on our
> applications, obviously, but I was also wondering what happens if I just
> set the number of slots to 8 in SGE, and just let things run. It
> particular,
> how will things be laid out if I do "mpirun --mca mpi_paffinity_alone
> 1"?

0, 1, 2, 3, 4, 5, etc. As usual.

> 1. Will it be clever enough to schedule each process on its own core,
> and only resort to the second SMT virtual core if I go over 8
> processes per node (dual quad core)?

No. "Clever" is not part of mpi_paffinity_alone semantics. The
semantics are 0, 1, 2, 3, etc. What that means with respect to cores,
sockets, hardware threads, etc., depends on how your BIOS numbers these
things. It could be "good". It could be "bad" (e.g., doubling
subscribing a core before moving on to the next one).

> 2. If it's not that clever, can I pass a rank file?

Yes.

> 3. If I do have to do that, what is the mapping between core numbers
> and processor/core/SMT virtual cores?

Depends on your BIOS, I think. Take a look at /proc/cpuinfo. Here is
one example:

$ grep "physical id" /proc/cpuinfo
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
$ grep "core id" /proc/cpuinfo
core id : 0
core id : 0
core id : 1
core id : 1
core id : 2
core id : 2
core id : 3
core id : 3
core id : 0
core id : 0
core id : 1
core id : 1
core id : 2
core id : 2
core id : 3
core id : 3

In this case, sequential binding takes you round-robin between the
sockets (physical id), on each socket loading up the cores. Only after
the first 8 do you revisit cores. So, that's a "good" numbering.

Starting in OMPI 1.3.4, there is "improved" binding support, but it's
not aware of hardware threads. If you're okay using only one thread per
core, that may be fine for you. You could run with "mpirun -bysocket
-bind-to-socket". If you need to use more than one thread per core,
however, that won't do the job for you. You'd have to use rankfiles or
something.