Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-11-18 05:57:31


Yes, I believe this solves the mystery. In short OGE and ORTE both
work. In the linear:1 case the job is exiting because there are not
enough resources for the orte binding to work, which actually makes
sense. In the linear:2 case I think we've proven that we are binding to
the right amount of resources and to the correct physical resources at
the process level.

In the case you do not do pass bind-to-core to mpirun with a qsub using
linear:2 the processes on the same node will actually bind to the same
two cores. The only way to determine this is to run something that
prints out the binding from the system. There is no way to do this via
OMPI because it only reports binding when you are requesting mpirun to
do some type of binding (like -bind-to-core or -bind-to-socket.

In the linear:1 case with no binding I think you are having the
processes on the same node run on the same core. Which is exactly what
you are asking for I believe.

So I believe we understand what is going on with the binding and it
makes sense to me. As far as the allocation issue of slots vs. cores
and trying to not overallocate cores I believe the new allocation rule
make sense to do but I'll let you hash that out with Daniel.

In summary I don't believe there is any OMPI bugs related to what we've
seen and the OGE issue is just the allocation issue, right?

--td

On 11/18/2010 01:32 AM, Chris Jewell wrote:
>>> Perhaps if someone could run this test again with --report-bindings --leave-session-attached and provide -all- output we could verify that analysis and clear up the confusion?
>>>
>> Yeah, however I bet you we still won't see output.
> Actually, it seems we do get more output! Results of 'qsub -pe mpi 8 -binding linear:2 myScript.com'
>
> with
>
> 'mpirun -mca ras_gridengine_verbose 100 -report-bindings --leave-session-attached -bycore -bind-to-core ./unterm'
>
> [exec1:06504] System has detected external process binding to cores 0028
> [exec1:06504] ras:gridengine: JOB_ID: 59467
> [exec1:06504] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec1/active_jobs/59467.1/pe_hostfile
> [exec1:06504] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
> [exec1:06504] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06504] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06504] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06504] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06504] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06504] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],0] to cpus 0008
> [exec1:06504] [[59608,0],0] odls:default:fork binding child [[59608,1],1] to cpus 0020
> [exec3:20248] [[59608,0],1] odls:default:fork binding child [[59608,1],2] to cpus 0008
> [exec4:26792] [[59608,0],4] odls:default:fork binding child [[59608,1],5] to cpus 0001
> [exec2:32462] [[59608,0],2] odls:default:fork binding child [[59608,1],3] to cpus 0001
> [exec7:09833] [[59608,0],3] odls:default:fork binding child [[59608,1],4] to cpus 0002
> [exec5:10834] [[59608,0],5] odls:default:fork binding child [[59608,1],6] to cpus 0001
> [exec6:04230] [[59608,0],6] odls:default:fork binding child [[59608,1],7] to cpus 0001
>
> AHHA! Now I get the following if I use 'qsub -pe mpi 8 -binding linear:1 myScript.com' with the above mpirun command:
>
> [exec1:06552] System has detected external process binding to cores 0020
> [exec1:06552] ras:gridengine: JOB_ID: 59468
> [exec1:06552] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec1/active_jobs/59468.1/pe_hostfile
> [exec1:06552] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
> [exec1:06552] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06552] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06552] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06552] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06552] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=1
> [exec1:06552] ras:gridengine: exec6.cluster.stats.local: PE_HOSTFILE shows slots=1
> --------------------------------------------------------------------------
> mpirun was unable to start the specified application as it encountered an error:
>
> Error name: Unknown error: 1
> Node: exec1
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> [exec1:06552] [[59432,0],0] odls:default:fork binding child [[59432,1],0] to cpus 0020
> --------------------------------------------------------------------------
> Not enough processors were found on the local host to meet the requested
> binding action:
>
> Local host: exec1
> Action requested: bind-to-core
> Application name: ./unterm
>
> Please revise the request and try again.
> --------------------------------------------------------------------------
> [exec4:26816] [[59432,0],4] odls:default:fork binding child [[59432,1],5] to cpus 0001
> [exec3:20345] [[59432,0],1] odls:default:fork binding child [[59432,1],2] to cpus 0020
> [exec2:32486] [[59432,0],2] odls:default:fork binding child [[59432,1],3] to cpus 0001
> [exec7:09921] [[59432,0],3] odls:default:fork binding child [[59432,1],4] to cpus 0002
> [exec6:04257] [[59432,0],6] odls:default:fork binding child [[59432,1],7] to cpus 0001
> [exec5:10861] [[59432,0],5] odls:default:fork binding child [[59432,1],6] to cpus 0001
>
>
>
> Hope that helps clear up the confusion! Please say it does, my head hurts...
>
> Chris
>
>
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture