Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Chris Jewell (chris.jewell_at_[hidden])
Date: 2010-11-15 09:29:06


Hi,

> > If, indeed, it is not possible currently to implement this type of core-binding in tightly integrated OpenMPI/GE, then a solution might lie in a custom script run in the parallel environment's 'start proc args'. This script would have to find out which slots are allocated where on the cluster, and write an OpenMPI rankfile.
>
> Exactly this should work.
>
> If you use "binding_instance" "pe" and reformat the information in the $PE_HOSTFILE to a "rankfile", it should work to get the desired allocation. Maybe you can share the script with this list once you got it working.

As far as I can see, that's not going to work. This is because, exactly like "binding_instance" "set", for -binding pe linear:n you get n cores bound per node. This is easily verifiable by using a long job and examining the pe_hostfile. For example, I submit a job with:

$ qsub -pe mpi 8 -binding pe linear:1 myScript.com

and my pe_hostfile looks like:

exec6.cluster.stats.local 2 batch.q_at_exec6.cluster.stats.local 0,1
exec1.cluster.stats.local 1 batch.q_at_exec1.cluster.stats.local 0,1
exec7.cluster.stats.local 1 batch.q_at_exec7.cluster.stats.local 0,1
exec5.cluster.stats.local 1 batch.q_at_exec5.cluster.stats.local 0,1
exec4.cluster.stats.local 1 batch.q_at_exec4.cluster.stats.local 0,1
exec3.cluster.stats.local 1 batch.q_at_exec3.cluster.stats.local 0,1
exec2.cluster.stats.local 1 batch.q_at_exec2.cluster.stats.local 0,1

Notice that, because I have specified the -binding pe linear:1, each execution node binds processes for the job_id to one core. If I have -binding pe linear:2, I get:

exec6.cluster.stats.local 2 batch.q_at_exec6.cluster.stats.local 0,1:0,2
exec1.cluster.stats.local 1 batch.q_at_exec1.cluster.stats.local 0,1:0,2
exec7.cluster.stats.local 1 batch.q_at_exec7.cluster.stats.local 0,1:0,2
exec4.cluster.stats.local 1 batch.q_at_exec4.cluster.stats.local 0,1:0,2
exec3.cluster.stats.local 1 batch.q_at_exec3.cluster.stats.local 0,1:0,2
exec2.cluster.stats.local 1 batch.q_at_exec2.cluster.stats.local 0,1:0,2
exec5.cluster.stats.local 1 batch.q_at_exec5.cluster.stats.local 0,1:0,2

So the pe_hostfile still doesn't give an accurate representation of the binding allocation for use by OpenMPI. Question: is there a system file or command that I could use to check which processors are "occupied"?

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778