I am not a grid engine expert by any means, but I do know a bit about OMPI's internals for binding processes.
Here is what we do:
1. mpirun gets its list of hosts from the environment, or from your machine file. It then maps the processes across the machines.
2. mpirun launches a daemon on each node that will host mpi processes. This launch is done with --inherit set.
3. each daemon "senses" the local binding constraint by querying the OS to get a list of processors available to it on this node.
4. each daemon spawns its local mpi processes, directly telling the OS to bind each process to one of the available processors. The processors are selected on a round robin basis determined by their relative MPI rank. So you should never get two processes assigned to the same processor if adequate processors are available. If you are, then that is an OMPI bug.
So SGE is responsible for setting up the global binding (i.e., telling each SGE node how many processors we are allowed to use on that node), and then OMPI uses that info to set the binding on the individual procs via the local OS.
The key thing to understand here is that SGE has zero visibility or knowledge of the individual MPI procs. All SGE ever sees is mpirun and its daemons.
On Nov 13, 2010, at 7:39 AM, Chris Jewell wrote:
> Hi Dave, Reuti,
> Sorry for kicking off this thread, and then disappearing. I've been away for a bit. Anyway, Dave, I'm glad you experienced the same issue as I had with my installation of SGE 6.2u5 and OpenMPI with core binding -- namely that with 'qsub -pe openmpi 8 -binding set linear:1 <myscript.com>', if two or more of the parallel processes get scheduled to the same execution node, then the processes end up being bound to the same core. Not good!
> I've been playing around quite a bit trying to understand this issue, and ended up on the GE dev list:
> It seems that most people expect that calls to 'qrsh -inherit' (that I assume OpenMPI uses to bind parallel processes to reserved GE slots) activates a separate binding. This does not appear to be the case. I *was* hoping that using -binding pe linear:1 might enable me to write a script that read the pe_hostfile and created a machine file for OpenMPI, but this fails as GE does not appear to give information as to which cores are unbound, only the number required.
> So, for now, my solution has been to use a JSV to remove core binding for the MPI jobs (but retain it for serial and SMP jobs). Any more ideas??
> (PS. Dave: how is my alma mater these days??)
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> CV4 7AL
> Tel: +44 (0)24 7615 0778
> users mailing list