On Jul 14, 2011, at 5:46 PM, Jeff Squyres wrote:
> Looping in the users mailing list so that Ralph and Oracle can comment...
Not entirely sure what I can contribute here, but I'll try - see below for some clarifications. I think the discussion here is based on some misunderstanding of how OMPI works.
> On Jul 14, 2011, at 2:34 PM, Rayson Ho wrote:
>> (CC'ing Jeff from the Open-MPI project...)
>> On Thu, Jul 14, 2011 at 1:35 PM, Tad Kollar <tad.kollar_at_[hidden]> wrote:
>>> As I thought more about it, I was afraid that might be the case, but hoped
>>> sge_shepherd would do some magic for tightly-integrated jobs.
>> To SGE, if each of the tasks is not started by sge_shepherd, then the
>> only option is to set the binding mask to the allocation, which in
>> your original case, was the whole system (48 CPUs).
>>> We're running OpenMPI 1.5.3 if that makes a difference. Do you know of
>>> anyone using an MVAPICH2 1.6 pe that can handle binding?
OMPI uses its own binding scheme - we stick within the overall binding envelope given to us, but we don't use external bindings of individual procs. Reason is simple: SGE has no visibility into the MPI procs we spawn. All SGE sees is mpirun and the daemons (called orteds) we launch on each node, and so it can't provide a binding scheme for the MPI procs (it actually has no idea how many procs are on each node as OMPI's mapper can support a multitude of algorithms, all invisible to SGE).
>> I just downloaded Open MPI 1.5.4a and grep'ed the source, looks like
>> it is not looking at the SGE_BINDING env variable that is set by SGE.
No, we don't. However, the orteds do check to see if they have been bound, and if so, to what processors. Those bindings are then used as an envelope limiting the processors we use to bind the procs we spawn.
>>> The serial case worked (its affinity list was '0' instead of '0-47'), so at
>>> least we know that's in good shape :-)
>> Please also submit a few more jobs and see if the new hwloc code is
>> able to handle multiple jobs running on your AMD MC server.
>>> My ultimate goal is for affinity support to be enabled and scheduled
>>> automatically for all MPI users, i.e. without them having to do any more
>>> than they would for a no-affinity job (otherwise I have a feeling most of
>>> them would just ignore it). What do you think it will take to get to that
We tried to do this once - I set a default param to auto-bind processes. Major error. I was lynched by the MPI user community until we removed that param.
Reason is simple: suppose you have MPI processes that launch threads. Remember, there is no thread-level binding out there - all the OS will let you do is bind at the process level. So now you bind someone's MPI process to some core(s), which forces all the threads from that process to stay within that binding....thereby potentially creating a horrendous thread-contention problem.
It doesn't take threading to cause problems - some applications just don't work as well when bound. It's true that the benchmarks generally do, but they aren't representative of real applications.
Bottom line: defaulting to binding processes was something the MPI community appears to have rejected, with reason. Might argue about whether or not they are correct, but that appears to be the consensus, and it is the position OMPI has adopted. User ignorance of when to bind and when not to bind is not a valid reason to impact everyone.
>> That's my goal since 2008...
>> I started a mail thread, "processor affinity -- OpenMPI / batchsystem
>> integration" to the Open MPI list in 2008. And in 2009, the conclusion
>> was that Sun was saying that the binding info is set in the
>> environment and Open MPI would perform the binding itself (so I
>> assumed that was done):
It is done - we just use OMPI's binding schemes and not the ones provided natively by SGE. Like I said above, SGE doesn't see the MPI procs and can't provide a binding pattern for them - so looking at the SUNW_MP_BIND envar is pointless.
>> Revisiting the presentation (see: job2core.pdf link at the above URL),
>> Sun's variable name is $SUNW_MP_BIND, so it is most likely Sun Cluster
>> Toolkit implementation specific rather than a feature in Open MPI --
>> and looking at the Open MPI code I don't see SUNW_MP_BIND referenced
>> I believe it is a matter of integrating the thread binding support
>> between the 2 -- both SGE & Open MPI support thread binding.
I don't believe this is accurate - certainly OMPI doesn't support thread-level binding, and I haven't seen an OS that does yet. Might happen someday...but I suspect you mean "process" and not "thread".
>> harder part is to handle cross node binding as SGE binds threads
>> locally only (not directly controlled by qmaster) -- may be a call to
>> "qstat -cb -j <job id>" would do the trick, and the info is parsed and
>> passed to mpirun via the "--rankfile" option.
> Jeff Squyres
> For corporate legal information go to:
> users mailing list