Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] distributing across cores with hwloc-distrib
From: Tim Creech (tcreech_at_[hidden])
Date: 2014-03-30 12:00:03


Thanks! This is very helpful. With the patch in place I see very
reasonable output.

Might this patch (eventually) make it into a hwloc release?

-Tim

On Sun, Mar 30, 2014 at 05:32:38PM +0200, Brice Goglin wrote:
> Don't worry, binding multithreaded processes is not a corner case. I was
> rather talking about the general "distributing less processes than there
> are object and returning cpusets as large as possible".
>
> The attached patch should help. Please let me know.
>
> Brice
>
>
> Le 30/03/2014 17:08, Tim Creech a écrit :
> > Hi Brice,
> > First, my apologies if this email starts a new thread. For some reason I
> > never received your response through Mailman and can only see it through the
> > web archive interface. I'm constructing this reponse without things like
> > "In-Reply-To".
> >
> > Thank you for your very helpful response. I'll use your explanation of the
> > algorithm and try to understand the implementation. I was indeed expecting
> > expecting hwloc-distrib to help me to bind multithreaded processes, although I
> > certainly can understand that this is considered a corner case. Could you
> > please consider fixing this?
> >
> > Thanks,
> > Tim
> >
> > Brice Goglin wrote:
> >> Hello,
> >>
> >> This is the main corner case of hwloc-distrib. It can return objects
> >> only, not groups of objects. The distrib algorithms is:
> >> 1) start at the root, where there are M children, and you have to
> >> distribute N processes
> >> 2) if there are no children, or if N is 1, return the entire object
> >> 3) split N into Ni (N = sum of Ni) into M pieces based on each children
> >> weight (the number of PUs under each)
> >> If N>=M, all Ni can be > 0, all children will get some process
> >> if N<M, you can't split N into M integer pieces, some Ni will be 0,
> >> these objects won't get any process
> >> 4) go back to (2) recurse in each children object with Ni instead of N
> >>
> >> Your case is step 3 with N=2 and M=4. It basically means that we
> >> distribute across cores without "assembling group of cores if needed".
> >>
> >> In your case, when you bind to 2 cores of 4 PUs each, your task only
> >> uses one PU in the end, 1 core and 3 PU are ignored as well. They *may*
> >> be used, but the operating system scheduler is free to ignore them. So
> >> binding to 2 cores or binding to 1 core or binding to 1 PU is almost
> >> equivalent. At least the latter is included in the former. And most
> >> people pass --single to get a single PU anyway.
> >>
> >> The case where it's not equivalent is when you bind multithreaded
> >> processes. If you have 8 threads, it's better to use 2 cores than 1
> >> single one. If this case matters to you, I will look into fixing this
> >> corner case.
> >>
> >> Brice
> >>
> >> Le 30/03/2014 07:56, Tim Creech a écrit :
> >>> Hello,
> >>> I would like to use hwloc_distrib for a project, but I'm having some
> >>> trouble understanding how it distributes. Specifically, it seems to
> >>> avoid distributing multiple processes across cores, and I'm not sure
> >>> why.
> >>>
> >>> As an example, consider the actual output of:
> >>>
> >>> $ hwloc-distrib -i "4 4" 2
> >>> 0x0000000f
> >>> 0x000000f0
> >>>
> >>> I'm expecting hwloc-distrib to tell me how to distribute 2 processes
> >>> across the 16 PUs (4 cores by 4 PUs), but the answer only involves 8
> >>> PUs, leaving the other 8 unused. If there were more cores on the
> >>> machine, then potentially the vast majority of them would be unused.
> >>>
> >>> In other words, I might expect the output to use all of the PUs across
> >>> cores, for example:
> >>>
> >>> $ hwloc-distrib -i "4 4" 2
> >>> 0x000000ff
> >>> 0x0000ff00
> >>>
> >>> Why does hwloc-distrib leave PUs unused? I'm using hwloc-1.9. Any help
> >>> in understanding where I'm going wrong is greatly appreciated!
> >>>
> >>> Thanks,
> >>> Tim
> >>>
> >>> _______________________________________________
> >>> hwloc-users mailing list
> >>> hwloc-users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> > _______________________________________________
> > hwloc-users mailing list
> > hwloc-users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>

> diff --git a/include/hwloc/helper.h b/include/hwloc/helper.h
> index 750f404..62fbba4 100644
> --- a/include/hwloc/helper.h
> +++ b/include/hwloc/helper.h
> @@ -685,6 +685,7 @@ hwloc_distrib(hwloc_topology_t topology,
> {
> unsigned i;
> unsigned tot_weight;
> + unsigned given, givenweight;
> hwloc_cpuset_t *cpusetp = set;
>
> if (flags & ~HWLOC_DISTRIB_FLAG_REVERSE) {
> @@ -697,23 +698,40 @@ hwloc_distrib(hwloc_topology_t topology,
> if (roots[i]->cpuset)
> tot_weight += hwloc_bitmap_weight(roots[i]->cpuset);
>
> - for (i = 0; i < n_roots && tot_weight; i++) {
> - /* Give to roots[] a portion proportional to its weight */
> + for (i = 0, given = 0, givenweight = 0; i < n_roots; i++) {
> + unsigned chunk, weight;
> hwloc_obj_t root = roots[flags & HWLOC_DISTRIB_FLAG_REVERSE ? n_roots-1-i : i];
> - unsigned weight = root->cpuset ? hwloc_bitmap_weight(root->cpuset) : 0;
> - unsigned chunk = (n * weight + tot_weight-1) / tot_weight;
> - if (!root->arity || chunk == 1 || root->depth >= until) {
> + hwloc_cpuset_t cpuset = root->cpuset;
> + if (!cpuset)
> + continue;
> + weight = hwloc_bitmap_weight(cpuset);
> + if (!weight)
> + continue;
> + /* Give to roots[] a chunk proportional to its weight.
> + * If previous chunks got rounded-up, we'll get a bit less. */
> + chunk = (( (givenweight+weight) * n + tot_weight-1) / tot_weight)
> + - (( givenweight * n + tot_weight-1) / tot_weight);
> + if (!root->arity || chunk <= 1 || root->depth >= until) {
> /* Got to the bottom, we can't split any more, put everything there. */
> - unsigned j;
> - for (j=0; j<n; j++)
> - cpusetp[j] = hwloc_bitmap_dup(root->cpuset);
> + if (chunk) {
> + /* Fill cpusets with ours */
> + unsigned j;
> + for (j=0; j < chunk; j++)
> + cpusetp[j] = hwloc_bitmap_dup(cpuset);
> + } else {
> + /* We got no chunk, just add our cpuset to a previous one
> + * so that we don't get ignored.
> + * (the first chunk cannot be empty). */
> + assert(given);
> + hwloc_bitmap_or(cpusetp[-1], cpusetp[-1], cpuset);
> + }
> } else {
> /* Still more to distribute, recurse into children */
> hwloc_distrib(topology, root->children, root->arity, cpusetp, chunk, until, flags);
> }
> cpusetp += chunk;
> - tot_weight -= weight;
> - n -= chunk;
> + given += chunk;
> + givenweight += weight;
> }
>
> return 0;