Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Job distribution on many-core NUMA system
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-09-01 18:23:01

A. Austen wrote:
On Fri, 28 Aug 2009 10:16 -0700, "Eugene Loh" <Eugene.Loh@Sun.COM>
Big topic and actually the subject of much recent discussion.  Here are 
a few comments:

1)  "Optimally" depends on what you're doing.  A big issue is making 
sure each MPI process gets as much memory bandwidth (and cache and other 
shared resources) as possible.   This would argue that processes 
*should* be spread over as many sockets as possible.  And, indeed, some 
MPIs default to this behavior.  It depends on lots of things, including 
how much of the machine you're using.
Yes, you're right.  In my case, my processes within a single MPI job are
tightly coupled.  These jobs are communication-intensive, and if I want
to use as many of the processors as possible, then minimizing the
cross-processor communication should yield the best overall throughput. 
However, I see your point completely -- for an embarassingly parallel
problem, spreading the processes amongst the different sockets/memory
pools would probably give the best performance.
The problem doesn't even need to be embarrassingly parallel.  Many MPI applications depend on computational performance, which is often sensitive to memory bandwidth.  This factor can be more important to application performance than interprocess communications.
2)  Currently (1.3.2), there is rankfile support.  This is probably a 
little bit more gruesome than you hope for.  E.g., if you have multiple 
jobs, you need to custom tailor the rankfile for each.
So then it would seem like at least for now, I can get the behavior I
want by using rankfiles?
Yes.  Or, pick up the latest/greatest changes in the trunk (bind-by-core, etc.), but there still is no multi-job awareness.
Also, if I use the rankfile to distribute the processes, how about the
affinity issue?  Can I still use affinity and expect that it will apply
to the topology specified in the rankfile, or will all the MPI jobs
always try to bind to the same processors in sequence?
If you use rankfiles, each MPI job will try to bind per the rankfile specified for it.  So, if you're willing to construct a different rankfile for each job, you'll be set with rankfiles.