On Jan 24, 2007, at 7:03 AM, Pak Lui wrote:
> Geoff Galitz wrote:
>> On the following system:
>> OpenMPI 1.1.1
>> SGE 6.0 (with tight integration)
>> Scientific Linux 4.3
>> Dual Dual-Core Opterons
>> MPI jobs are oversubscribing to the nodes. No matter where jobs
>> are launched by the scheduler, they always stack up on the first
>> node (node00) and continue to stack even though the system load
>> exceeds 6 (on a 4 processor box). Eeach node is defined as 4
>> slots with 4 max slots. The MPI jobs launch via "mpirun -np
>> (some-number-of- processors)" from within the scheduler.
> Hi Geoff,
> I think we first start having SGE support in 1.2, not in 1.1.1.
> Unless you did some modification on your own to include the
> gridengine ras/pls modules from v1.2, you probably are not using
> the SGE tight integration. So even though you start mpirun in the
> SGE parallel environment, ORTE does not have the gridengine modules
> for allocating and launching the jobs, so that could be why all
> processes are launched on the same node. (because there's no node
> list available from gridengine and it defaults to single node)
I have used the backport instructions provided by Olli-Pekka Lehto.
Of course, if it is running properly in my case, I can't say as I am
certainly not getting the expected behavior, although the jobs do run.
> On a related note, there is a way for SGE to allocate and assign
> slots for launching tasks. It is done by setting the allocation
> rule in the parallel environment (PE). If all of the slots are
> allocated on the same node, it sounds like the allocation rule has
> been set to $fill_up. Maybe you can try with $round_robin instead?
If I use $round_robin, one MPI process starts up per node and then
wraps around the cluster. So if I have 4 process MPI job, it starts
1 process on 4 nodes which is certainly not the most efficient method.
>> It seems to me that MPI is not detecting that the nodes are
>> overloaded and that due to the way the job slots are defined and
>> how mpirun is being called. If I read the documentation
>> correctly, a single mpirun run consumes one job slot no matter
>> the number of processes which are launched. We can chagne the
>> number of job slots, but then we expect to waste processors since
>> only one mpirun job will run on any node, even if the job is only
>> a two processor job.
> As for oversubscription, I remember we start having that -
> nooversubscribe option in v1.2 so if you want to limit ORTE from
> oversubscribing because by default oversubscription is allowed.
So it seems the real story for me is that there is no logic that
detects the oversubscription condition and re-schedules the job for
another node in the MPI nodelist in OpenMPI 1.1.1? If so, that would
certainly explain what I am seeing. Is that correct?