On Mar 24, 2008, at 6:29 AM, Mark Kosmowski wrote:
> I have a successful ompi installation and my software runs across my
> humble cluster of three dual-Opteron (single core) nodes on OpenSUSE
> 10.2. I'm planning to upgrade some RAM soon and have been thinking of
> playing with affinity, since each cpu will have it's own DIMMs after
> the upgrade. I have read the FAQ and know to use "--mca
> mpi_paffinity_alone 1" to enable affinity.
> It looks like I am running ompi 1.1.4 (see below).
> mark_at_LT:~> ompi_info | grep affinity
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.4)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component
> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component
> Does this old version of ompi do a good job of implementing affinity
> or would it behoove me to use the current version if I am interested
> in trying affinity?
It's the same level of affinity support as in the 1.2 series.
There are a few affinity upgrades in development, some of which will
hit for the v1.3 series, some of which will be later:
- upgrade to a newer embedded version of PLPA; this probably won't
affect you much (will be in v1.3)
- allow assigning MPI processes to specific socket/core combinations
via a file specification (will be in v1.3)
- have some "better" launch support such that resource managers who
implement their own affinity controls (e.g., SLURM) can directly set
the affinity for MPI processes (some future version; probably won't be
ready for v1.3).
> What sorts of time gains do people typically see with affinity? (I'm
> a chemistry student running planewave solid state calculation software
> if this helps with the question)
As with everything, it depends. :-)
- If you're just running one MPI process per core and you only have
one core per socket, you might simply see a "smoothing" of results --
meaning that multiple runs of the same job will have slightly more
consistent timing results (e.g., less "jitter" in the timings)
- If you have a NUMA architecture (e.g., AMD) and have multiple NICs,
you can play games to get the MPI processes who are actually doing the
communicating to be "close" to the NIC in the internal host topology.
If your app is using a lot of short messages over low-latency
interconnects, this can make a difference. If you're using TCP, it
likely won't make much of a difference. :-)
> Lastly, two of the three machines will have all of their DIMM slots
> populated by equal sized DIMMs. However, one of my machines has two
> processors, each of which having four DIMM slots. This machine will
> be getting 4 @ 1 Gb DIMMs and 2 @ 2 Gb DIMMs. i am assuming that the
> best thing for affinity would be to put all of the 1 Gb DIMMs to one
> processor and the 2 Gb DIMMs to the other and to put the 2 Gb DIMMs in
> slots 0 and 1. Does it matter which processor gets which set of
It depends on what your application is doing. You generally want to
have enough "local" RAM for the [MPI] processes that will be running
on each socket.