Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Chris Jewell (chris.jewell_at_[hidden])
Date: 2010-11-15 11:06:17

Hi Ralph,

Thanks for the tip. With the command

$ qsub -pe mpi 8 -binding linear:1

I get the output

[exec6:29172] System has detected external process binding to cores 0008
[exec6:29172] ras:gridengine: JOB_ID: 59282
[exec6:29172] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec6/active_jobs/59282.1/pe_hostfile
[exec6:29172] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows slots=2
[exec6:29172] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec6:29172] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec6:29172] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec6:29172] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec6:29172] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec6:29172] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=1

Presumably that means that OMPI is detecting the external binding okay. If so, then that confirms my problem as an issue with how GE sets the processor affinity -- essentially the controlling sge_shepherd process on each physical exec node gets bound to the requested number of cores (in this case 1) resulting in any child process (ie the ompi parallel processes) being bound to the same core. What we really need is for GE to set the binding on each execution node according to the number of parallel processes that will run there. Not sure this is doable currently...



Dr Chris Jewell
Department of Statistics
University of Warwick
Tel: +44 (0)24 7615 0778