Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Chris Jewell (chris.jewell_at_[hidden])
Date: 2010-11-18 01:32:37


>
>> Perhaps if someone could run this test again with --report-bindings --leave-session-attached and provide -all- output we could verify that analysis and clear up the confusion?
>>
> Yeah, however I bet you we still won't see output.

Actually, it seems we do get more output! Results of 'qsub -pe mpi 8 -binding linear:2 myScript.com'

with

'mpirun -mca ras_gridengine_verbose 100 -report-bindings --leave-session-attached -bycore -bind-to-core ./unterm'

[exec1:06504] System has detected external process binding to cores 0028
[exec1:06504] ras:gridengine: JOB_ID: 59467
[exec1:06504] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec1/active_jobs/59467.1/pe_hostfile
[exec1:06504] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
[exec1:06504] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06504] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06504] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06504] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06504] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06504] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],0] to cpus 0008
[exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],1] to cpus 0020
[exec3:20248] [[59608,0],1] odls:default:fork binding child [[59608,1],2] to cpus 0008
[exec4:26792] [[59608,0],4] odls:default:fork binding child [[59608,1],5] to cpus 0001
[exec2:32462] [[59608,0],2] odls:default:fork binding child [[59608,1],3] to cpus 0001
[exec7:09833] [[59608,0],3] odls:default:fork binding child [[59608,1],4] to cpus 0002
[exec5:10834] [[59608,0],5] odls:default:fork binding child [[59608,1],6] to cpus 0001
[exec6:04230] [[59608,0],6] odls:default:fork binding child [[59608,1],7] to cpus 0001

AHHA! Now I get the following if I use 'qsub -pe mpi 8 -binding linear:1 myScript.com' with the above mpirun command:

[exec1:06552] System has detected external process binding to cores 0020
[exec1:06552] ras:gridengine: JOB_ID: 59468
[exec1:06552] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec1/active_jobs/59468.1/pe_hostfile
[exec1:06552] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
[exec1:06552] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06552] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06552] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06552] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06552] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec1:06552] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows slots=1
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error:

Error name: Unknown error: 1
Node: exec1

when attempting to start process rank 0.
--------------------------------------------------------------------------
[exec1:06552] [[59432,0],0] odls:default:fork binding child [[59432,1],0] to cpus 0020
--------------------------------------------------------------------------
Not enough processors were found on the local host to meet the requested
binding action:

  Local host: exec1
  Action requested: bind-to-core
  Application name: ./unterm

Please revise the request and try again.
--------------------------------------------------------------------------
[exec4:26816] [[59432,0],4] odls:default:fork binding child [[59432,1],5] to cpus 0001
[exec3:20345] [[59432,0],1] odls:default:fork binding child [[59432,1],2] to cpus 0020
[exec2:32486] [[59432,0],2] odls:default:fork binding child [[59432,1],3] to cpus 0001
[exec7:09921] [[59432,0],3] odls:default:fork binding child [[59432,1],4] to cpus 0002
[exec6:04257] [[59432,0],6] odls:default:fork binding child [[59432,1],7] to cpus 0001
[exec5:10861] [[59432,0],5] odls:default:fork binding child [[59432,1],6] to cpus 0001

Hope that helps clear up the confusion! Please say it does, my head hurts...

Chris

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778