Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Chris Jewell (chris.jewell_at_[hidden])
Date: 2010-11-16 12:39:25


On 16 Nov 2010, at 17:25, Terry Dontje wrote:
>>>
>> Sure. Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 -binding linear:2 myScript.com' where myScript.com runs 'mpirun -mca ras_gridengine_verbose 100 --report-bindings ./unterm':
>>
>> [exec4:17384] System has detected external process binding to cores 0022
>> [exec4:17384] ras:gridengine: JOB_ID: 59352
>> [exec4:17384] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
>> [exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=2
>> [exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=1
>>
>>
>>
> Is that all that came out? I would have expected a some output from each process after the orted forked the processes but before the exec of unterm.

Yes. It appears that if orted detects binding done by external processes, then this is all you get. Scratch the GE enforced binding, and you get:

[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],0] to cpus 0001
[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],1] to cpus 0002
[exec7:06781] [[23443,0],2] odls:default:fork binding child [[23443,1],3] to cpus 0001
[exec2:24160] [[23443,0],1] odls:default:fork binding child [[23443,1],2] to cpus 0001
[exec6:30097] [[23443,0],4] odls:default:fork binding child [[23443,1],5] to cpus 0001
[exec5:02736] [[23443,0],6] odls:default:fork binding child [[23443,1],7] to cpus 0001
[exec1:30779] [[23443,0],5] odls:default:fork binding child [[23443,1],6] to cpus 0001
[exec3:12818] [[23443,0],3] odls:default:fork binding child [[23443,1],4] to cpus 0001
.....

C

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778