On Mon, 2011-02-28 at 22:17 +0100, Brice Goglin wrote:
> Le 28/02/2011 22:04, Jeff Squyres a écrit :
> > That being said, someone cited on this list a long time ago that running the hwloc detection on very large machines (e.g., SGI machines with 1000+ cores) takes on the order of seconds (because it traverses /sys, etc.). So if you want your tool to be used on machines like that, then it might be better to do the discovery once and share that data among your threads.
> People running on such large machines should really export the machine
> topology to XML once and reload from there all the time.
Btw. lstopo on such a large machine (64 NUMA nodes, 1024 logical CPUs)
takes about 0.6 seconds at our site.
This is accepteable for scripts, that run only frequently. It is also
accepteable for executables that need the topology info at start time
(e.g. pbs_mom of Torque).
To calculate topology-based pinning schemes and do process pinning (like
done e.g. by OpenMPI or MVAPICH2) this is too long, when every process
(MPI task) or thread loads the topology in parallel. But exporting an
XML topology and using this for this purpose is inaccepteable, when
Linux cpusets are used, because one needs the topology of a subset of
the machine depending on the caller context. What we currently do is to
let only one process per machine load the topology, and distribute the
essentials needed for pinning to the remaining processes.
> hwloc-devel mailing list
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin