I confess I'm confused. OMPI allows you to "oversubscribe" a node without any modification of job allocations. Just ask it to launch however many processes you want - it will ignore the allocated number of slots and do it.
It will set the sched_yield appropriately to deal with the oversubscription - i.e., each individual process will run a little slower than it would otherwise have done, but play nicer when it comes to "sharing" the available cpus. Likewise, it won't bind your processes to a specific core, which is what you want in this scenario.
So all you have to do is change your mpirun line to have -np N, where N is the actual number of desired processes. Or, if you prefer, you can use
to tell mpirun to launch M processes on each node. Either will work.
On Dec 26, 2011, at 10:32 AM, Santosh Ansumali wrote:
> Thanks for the response. May be I am wrong. However my argument is as
> follows: our test shows that a 100^3 grid per core performs 10 times
> faster (normalised in proper unit) than 200^3. Both of these sizes
> are not fitting in cache. 100^3 run is benefiting due to smaller size
> where compiler is guessing access pattern in slightly better way.
> So, in case of running one large job of 200^3 per core if I
> oversubscribe the core with smaller jobs of size comparable to 100^3,
> high saving due to better memory access should compensate for thread
> On Mon, Dec 26, 2011 at 10:31 PM, Matthieu Brucher
> <matthieu.brucher_at_[hidden]> wrote:
>> If your problem is memory bound and if you don't use the whole memory
>> capacity of one node, it means that you are limited by your memory
>> bandwidth. In this case oversubscribing the number of processes will lead to
>> worse behavior, as all processes will fight for the same memory bandwidth.
>> Just my opinion.
>> Matthieu Brucher
>> 2011/12/23 Santosh Ansumali <ansumali_at_[hidden]>
>>> Dear All,
>>> We are running a PDE solver which is memory bound. Due to
>>> cache related issue, smaller number of grid point per core leads to
>>> better performance for this code. Thus, though available memory per
>>> core is more than 2 GB, we are able to good performance by using
>>> less than 1 GB per core.
>>> I want to know whether oversubscribing the cores can potentially
>>> improve performance of such a code. My thinking is that if I
>>> oversubscribe the cores, each thread will be using less than 1 GB so
>>> cache related problems will be less severe. Is this logic correct or
>>> due to cache conflict performance will deteriorate further?
>>> In case, over-subscription can help, how shall I modify
>>> submission file (using sun grid engine) to enable over-subscription of
>>> my current submission file is written as follows
>>> #$ -N first
>>> #$ -S /bin/bash
>>> #$ -cwd
>>> #$ -e $JOB_ID.$JOB_NAME.ERROR
>>> #$ -o $JOB_ID.$JOB_NAME.OUTPUT
>>> #$ -P faculty_prj
>>> #$ -p 0
>>> #$ -pe orte 8
>>> /opt/mpi/openmpi/1.3.3/gnu/bin/mpirun -np $NSLOTS ./test_vel.out
>>> Is it possible to allow over-subscription by modifying submission file
>>> itself? Or do I need to change hostfiles somehow?
>>> Thanks for your help!
>>> Best Regards
>>> Santosh Ansumali,
>>> Faculty Fellow,
>>> Engineering Mechanics Unit
>>> Jawaharlal Nehru Centre for Advanced Scientific Research (JNCASR)
>>> Jakkur, Bangalore-560 064, India
>>> Tel: + 91 80 22082938
>>> users mailing list
>> Information System Engineer, Ph.D.
>> Blog: http://matt.eifelle.com
>> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>> users mailing list
> Santosh Ansumali,
> Faculty Fellow,
> Engineering Mechanics Unit
> Jawaharlal Nehru Centre for Advanced Scientific Research (JNCASR)
> Jakkur, Bangalore-560 064, India
> Tel: + 91 80 22082938
> users mailing list