FWIW, especially on NUMA machines (like AMDs), physical access to
network resources (such as NICs / HCAs) can be much faster on
For example, we recently ran some microbenchmarks showing that if you
run 2 MPI processes across 2 NUMA machines (e.g., a simple ping-pong
benchmark across 2 machines), if you pin the MPI process to socket 0/
core 0, you'll get noticeably better latency. If you don't, the MPI
process may not be consistently located physically close to the NIC/
HCA, resulting in more "jitter" in the delivered latency (or even
worse, consistently worse latency).
I *believe* that this has to do with physical setup within the
machine (i.e., the NIC/HCA bus is physically "closer" to some
sockets), but I'm not much of a hardware guy to know that for sure.
Someone with more specific knowledge should chime in here...
On Dec 1, 2006, at 2:13 PM, Greg Lindahl wrote:
> On Fri, Dec 01, 2006 at 11:51:24AM +0100, Peter Kjellstrom wrote:
>> This might be a bit naive but, if you spawn two procs on a dual
>> core dual
>> socket system then the linux kernel should automagically schedule
>> them this
> No, we checked this for OpenMP and MPI, and in both cases wiring the
> processes to cores was significantly better. The Linux scheduler
> (still) tends to migrate processes to the wrong core when OS threads
> and processes wake up and go back to sleep.
> Just like the OpenMPI guys, we don't have a clever solution for the
> "what if the user wants to have 2 OpenMP or MPI jobs share the same
> node?" Well, I have a plan, but it's annoying to implement.
> -- greg
> users mailing list
Server Virtualization Business Unit