Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Chris Jewell (chris.jewell_at_[hidden])
Date: 2010-11-16 12:39:25


On 16 Nov 2010, at 17:25, Terry Dontje wrote:
>>>
>> Sure. Here's the stderr of a job submitted to my cluster with 'qsub -pe mpi 8 -binding linear:2 myScript.com' where myScript.com runs 'mpirun -mca ras_gridengine_verbose 100 --report-bindings ./unterm':
>>
>> [exec4:17384] System has detected external process binding to cores 0022
>> [exec4:17384] ras:gridengine: JOB_ID: 59352
>> [exec4:17384] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec4/active_jobs/59352.1/pe_hostfile
>> [exec4:17384] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=2
>> [exec4:17384] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=1
>> [exec4:17384] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=1
>>
>>
>>
> Is that all that came out? I would have expected a some output from each process after the orted forked the processes but before the exec of unterm.

Yes. It appears that if orted detects binding done by external processes, then this is all you get. Scratch the GE enforced binding, and you get:

[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],0] to cpus 0001
[exec4:17670] [[23443,0],0] odls:default:fork binding child [[23443,1],1] to cpus 0002
[exec7:06781] [[23443,0],2] odls:default:fork binding child [[23443,1],3] to cpus 0001
[exec2:24160] [[23443,0],1] odls:default:fork binding child [[23443,1],2] to cpus 0001
[exec6:30097] [[23443,0],4] odls:default:fork binding child [[23443,1],5] to cpus 0001
[exec5:02736] [[23443,0],6] odls:default:fork binding child [[23443,1],7] to cpus 0001
[exec1:30779] [[23443,0],5] odls:default:fork binding child [[23443,1],6] to cpus 0001
[exec3:12818] [[23443,0],3] odls:default:fork binding child [[23443,1],4] to cpus 0001
.....

C

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778