Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Chris Jewell (chris.jewell_at_[hidden])
Date: 2010-11-15 09:29:06


Hi,

> > If, indeed, it is not possible currently to implement this type of core-binding in tightly integrated OpenMPI/GE, then a solution might lie in a custom script run in the parallel environment's 'start proc args'. This script would have to find out which slots are allocated where on the cluster, and write an OpenMPI rankfile.
>
> Exactly this should work.
>
> If you use "binding_instance" "pe" and reformat the information in the $PE_HOSTFILE to a "rankfile", it should work to get the desired allocation. Maybe you can share the script with this list once you got it working.

As far as I can see, that's not going to work. This is because, exactly like "binding_instance" "set", for -binding pe linear:n you get n cores bound per node. This is easily verifiable by using a long job and examining the pe_hostfile. For example, I submit a job with:

$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2 batch.q_at_exec6.cluster.stats.local 0,1
exec1.cluster.stats.local 1 batch.q_at_exec1.cluster.stats.local 0,1
exec7.cluster.stats.local 1 batch.q_at_exec7.cluster.stats.local 0,1
exec5.cluster.stats.local 1 batch.q_at_exec5.cluster.stats.local 0,1
exec4.cluster.stats.local 1 batch.q_at_exec4.cluster.stats.local 0,1
exec3.cluster.stats.local 1 batch.q_at_exec3.cluster.stats.local 0,1
exec2.cluster.stats.local 1 batch.q_at_exec2.cluster.stats.local 0,1

Notice that, because I have specified the -binding pe linear:1, each execution node binds processes for the job_id to one core. If I have -binding pe linear:2, I get:

exec6.cluster.stats.local 2 batch.q_at_exec6.cluster.stats.local 0,1:0,2
exec1.cluster.stats.local 1 batch.q_at_exec1.cluster.stats.local 0,1:0,2
exec7.cluster.stats.local 1 batch.q_at_exec7.cluster.stats.local 0,1:0,2
exec4.cluster.stats.local 1 batch.q_at_exec4.cluster.stats.local 0,1:0,2
exec3.cluster.stats.local 1 batch.q_at_exec3.cluster.stats.local 0,1:0,2
exec2.cluster.stats.local 1 batch.q_at_exec2.cluster.stats.local 0,1:0,2
exec5.cluster.stats.local 1 batch.q_at_exec5.cluster.stats.local 0,1:0,2

So the pe_hostfile still doesn't give an accurate representation of the binding allocation for use by OpenMPI. Question: is there a system file or command that I could use to check which processors are "occupied"?

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778