Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] known limitation or bug in hwloc?
From: nadia.derbey_at_[hidden]
Date: 2011-08-29 12:08:57


devel-bounces_at_[hidden] wrote on 08/29/2011 05:57:59 PM:

> De : Ralph Castain <rhc_at_[hidden]>
> A : Open MPI Developers <devel_at_[hidden]>
> Date : 08/29/2011 05:58 PM
> Objet : Re: [OMPI devel] known limitation or bug in hwloc?
> Envoyé par : devel-bounces_at_[hidden]
>
> On Aug 29, 2011, at 8:35 AM, nadia.derbey_at_[hidden] wrote:
>
>
> devel-bounces_at_[hidden] wrote on 08/29/2011 04:20:30 PM:
>
> > De : Ralph Castain <rhc_at_[hidden]>
> > A : Open MPI Developers <devel_at_[hidden]>
> > Date : 08/29/2011 04:26 PM
> > Objet : Re: [OMPI devel] known limitation or bug in hwloc?
> > Envoyé par : devel-bounces_at_[hidden]
> >
> > Actually, I'll eat those words. I was looking at the wrong place.
> >
> > Yes, that is a bug in hwloc. It needs to loop over CPU_MAX for those
> > cases where the bit mask extends over multiple words.
>
> But I'm afraid the fix won't be trivial at all: hwloc in itself is
> coherent: it loops overs NUM_BITS, but it uses masks that are
> NUM_BITS wide (hwloc_bitmap_t set)...
>
> I guess I'm missing that - I just did a search and cannot find any
> reference to OPAL_PAFFINITY_BITMASK_T_NUM_BITS anywhere in
> paffinity/hwloc after the last change.
>
> Can you point me to where you believe a problem exists? Or feel free
> to submit a patch to fix it :-) We can push it upstream to the
> hwloc folks for their consideration.

file: opal/mca/paffinity/hwloc/paffinity_hwloc_module.c
routine: module_set()

You hae a reference to OPAL_PAFFINITY_BITMASK_T_NUM_BITS both in the trunk
and in v1.5

But may be this issue has been fixed already?

Regards,
Nadia

>
>
> Regards,
> Nadia
> >
> >
> > On Aug 29, 2011, at 7:16 AM, Ralph Castain wrote:
> >
> > > Actually, if you look closely at the definition of those two
> > values, you'll see that it really doesn't matter which one we loop
> > over. The NUM_BITS value defines the actual total number of bits in
> > the mask. The CPU_MAX is the total number of cpus we can support,
> > which was set to a value such that the two are equal (i.e., it's a
> > power of two that happens to be an integer multiple of 64).
> > >
> > > I believe the original intent was to allow CPU_MAX to be
> > independent of address-alignment questions, so NUM_BITS could
> > technically be greater than CPU_MAX. Even if this happens, though,
> > all that would do is cause the loop to run across more bits than
required.
> > >
> > > So it doesn't introduce a limitation at all. In hindsight, we
> > could simplify things by eliminating one of those values and just
> > putting a requirement on the number that it be a multiple of 64 so
> > it aligns with a memory address.
> > >
> > >
> > > On Aug 29, 2011, at 7:05 AM, Kenneth Lloyd wrote:
> > >
> > >> Nadia,
> > >>
> > >> Interesting. I haven't tried pushing this to levels above 8 on
> a particular
> > >> machine. Do you think that the cpuset / paffinity / hwloc only
applies at
> > >> the machine level, at which time you need to employ a graph with
carto?
> > >>
> > >> Regards,
> > >>
> > >> Ken
> > >>
> > >> -----Original Message-----
> > >> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]
] On
> > >> Behalf Of nadia.derbey
> > >> Sent: Monday, August 29, 2011 5:45 AM
> > >> To: Open MPI Developers
> > >> Subject: [OMPI devel] known limitation or bug in hwloc?
> > >>
> > >> Hi list,
> > >>
> > >> I'm hitting a limitation with paffinity/hwloc with cpu numbers >=
64.
> > >>
> > >> In opal/mca/paffinity/hwloc/paffinity_hwloc_module.c, module_set()
is
> > >> the routine that sets the calling process affinity to the mask
given as
> > >> parameter. Note that "mask" is a opal_paffinity_base_cpu_set_t (so
we
> > >> allow the cpus to be potentially numbered up to
> > >> OPAL_PAFFINITY_BITMASK_CPU_MAX - 1).
> > >>
> > >> The problem with module_set() is that is loops over
> > >> OPAL_PAFFINITY_BITMASK_T_NUM_BITS bits to check if these bits are
set in
> > >> the mask:
> > >>
> > >> for (i = 0; ((unsigned int) i) < OPAL_PAFFINITY_BITMASK_T_NUM_BITS;
++i)
> > >> {
> > >> if (OPAL_PAFFINITY_CPU_ISSET(i, mask)) {
> > >> hwloc_bitmap_set(set, i);
> > >> }
> > >> }
> > >>
> > >> Given "mask"'s type, I think module_set() should instead loop over
> > >> OPAL_PAFFINITY_BITMASK_CPU_MAX bits.
> > >>
> > >> Note that module_set() uses a type for its internal mask that is
> > >> coherent with OPAL_PAFFINITY_BITMASK_T_NUM_BITS (hwloc_bitmap_t).
> > >>
> > >> So I'm wondering whether this is a known limitation I've never
heard of
> > >> or an actual bug?
> > >>
> > >> Regards,
> > >> Nadia
> > >>
> > >>
> > >> _______________________________________________
> > >> devel mailing list
> > >> devel_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> -----
> > >> No virus found in this message.
> > >> Checked by AVG - www.avg.com
> > >> Version: 10.0.1392 / Virus Database: 1520/3864 - Release Date:
08/28/11
> > >>
> > >> _______________________________________________
> > >> devel mailing list
> > >> devel_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel