Yes I still have that in mind, but I am not a big fan of adding yet two other distribute functions. It's a pitty that we need 4 of them to get everything done. If we multiply by 2 every 2 years, it won't scale :)
The current interface doesn't well support distributing N objects among M child when M > N, that's causing your request (fixing this + singlify_last would be enough), and that may cause other requests in the future. But that's hard to fix properly.
I would rather deprecate the existing functions, and make a new single one (needs a new name) that won't need yet another fix in two years. At least, it'll take a flag argument (reverse would be one of this flag).

singlify_reverse() (or _last() or whatever good name) is ok (just needs some additional tests for make check).

Brice




Le 18/09/2013 14:42, Jiri Hladky a écrit :
Hi Brice,
hi Samuel,

hopefully you still remember discussion we had regarding a proposed new option 
--reverse_direction
or perhaps better
--start_from_last_object

which would force hwloc-distrib to start assigning PUs from the last object (rather than from PU#0 as the default)

I have now implemented the change on the latest trunk. 


I had to modify only three functions
  • hwloc_distributev in helper.h
  • hwloc_distribute in helper.h
  • hwloc_bitmap_singlify in bitmap.c
The change is trivial. In helper.h I had to only change the direction of loops, ie from

for(i=0;i<N;++i) {
  anything[i] = 
to
for(i=N;i>0;--i)
  anything[i-1] = 

(Please be careful here as "i" is unsigned int and it will wrap around when doing i-- for i == 0 . Alternative code would be
for (size_t i = n-1; i < n ; --i) { ... }
which has the benefit that you don't need to change i to i-1 in the loop body but it's for statement is less readable)
In hwloc_bitmap_singlify I had to
  • again change to loop direction
  • replace 
int _ffs = hwloc_ffsl(w);
set->ulongs[i] = HWLOC_SUBBITMAP_CPU(_ffs-1);
with
int _fls = hwloc_flsl(w);
set->ulongs[i-1] = HWLOC_SUBBITMAP_CPU(_fls-1);

The patch file is attached. 

Could you please review it if you agree to add it to the 1.8?

Other changes would require:
  • add the new command line option to hwloc-distrib
  • clone functions hwloc_distributev, hwloc_distribute, hwloc_bitmap_singlify to hwloc_distributev_reverse, hwloc_distribute_reverse, hwloc_bitmap_singlify_reverse and when a command line option is specified, call these new functions
I have tested the new code on 4 Socket system with 15 cores per socket and 30 PUs per socket and it works as expected:

utils/hwloc-distrib --single --taskset 64
0x800000000000000000000000000000
0x400000000000000000000000000000
0x100000000000000000000000000
0x20000000000000000000000
0x4000000000000000000

It will assign the jobs in this order
  1. last PU on the last core on Socket #4
  2. last PU on the last core Socket #3
  3. last PU on the last core Socket #2
  4. last PU on the last core Socket #1
  5. last PU on the second last core on Socket #4
Please let me know what you think about it. I would really appreciate if you can include it into the official code.

Thanks a lot!
Jirka




On Thu, Aug 29, 2013 at 9:58 AM, Brice Goglin <Brice.Goglin@inria.fr> wrote:
Le 28/08/2013 16:20, Jiri Hladky a écrit :
I got your point. On the other hand I think that hwloc-distrib is at the moment not flexible enough to handle such case. I believe that the current strategy - start from the first object - is not the best one. From my experience, core 0 is always most used by the system so it seems that better strategy would to allocate the cores from the last one.

Most people expect their jobs to be launched in order: process0 on first cores, process1 on next cores, etc.

Again, you don't want to reverse the output, you want to ignore first core/socket if possible.


I was looking at the source code of the hwloc-distrib and I believe that only this part of the code would be affected:

      for (i = 0; i < chunks; i++)
        roots[i] = hwloc_get_obj_by_depth(topology, from_depth, i);  => change this to roots[i] = hwloc_get_obj_by_depth(topology, from_depth, MAX_COUNT - i);

      hwloc_distributev(topology, roots, chunks, cpuset, n, to_depth); => rewrite this to iterate in the reverse direction

You can do the exact same thing by reversing your loop in the caller.

Anyway, reversing the loop just move the core you don't want to the end of the list. But if you use the entire list, you end up using the exact same cores. You just reordered the processes among these cores.


Am I missing something? In case of infinite bitmap hwloc-distrib will error out. This should solve the problems with hwloc_bitmap_singlify.

We need a new singlify() above, one that doesn't use the first bit. That's what will make you use a core that is not the first one in each socket.

Problems with infinite bitmaps are unrelated with hwloc-distrib, but they need to be handled in that new bitmap API anyway.

Brice


_______________________________________________
hwloc-devel mailing list
hwloc-devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel