Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Some practical hwloc API feedback
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-09-22 16:42:29

On Sep 22, 2011, at 2:25 PM, Brice Goglin wrote:

> Le 22/09/2011 21:36, Jeff Squyres a écrit :
>> 1. The depth-specific accessors are Bad. Given the warning language in the docs paired with the practical realities that some people actually do mix and match CPUs in a single server (especially when testing new chips), the depth-based accessors *can/will* fail. Meaning: you have to write application code that can handle the non-uniform depth cases, making the depth-based accessors essentially useless.
> I don't see any real problem with having depth accessors and mixed types
> of CPUs in a server. You can have different levels of caches in
> different CPUs,, but you still have a uniform depth/level for important
> things like PUs, Core, Socket.

I guess I didn't get that from your documentation. Since caches sit between socket and core, they appear to affect the depth of the core in a given socket. Thus, if there are different numbers of caches in the different sockets on a node, then the core/pu level would change across the sockets.

Is that not true?

> The only problem so far is caches. But do you actually walk the list of
> caches?

Yes we do

> People would walk the list of PUs, Cores, Sockets, NUMA nodes.
> But when talking about Caches, I would rather see them ask "which cache
> do I have above these cores?".

But that isn't exactly how people use that info. Instead, they ask us to "map N processes on each L2 cache across the node", or to "bind all procs to their local L3 cache".

> And I don't see how DFS would help. Any concrete example?

As above. If I'm trying to map a process to (say) a core, then I have to search for all the cores. If the system has different numbers of caches on each socket, then the current search for a core object seems to have a problem as it is looking at a specific depth, yet the cores are at different depths on each socket. So I have to manually traverse the tree looking for core objects at any depth.

Perhaps my understanding of your tree topology is wrong, though...

>> But we're using the XML export in OMPI to send the topology of compute nodes up to the scheduler, where decisions are made about how to lay out processes on the back-end compute nodes, what the binding width will be, etc. This front-end scheduler needs to know whether the back-end node is capable of supporting binding, for example.
>> We manually added this information into the message that we send up to the scheduler, but it would be much nicer if the XML export/import just handled that automatically.
> I guess we could add some "support" attributes to the XML.
> Does your scheduler actually need to know if binding is supported? What
> does it do if not supported? Can't just try to bind and get an error if
> not supported?

When dealing with large scale systems, it is much faster and easier to check these things -before- launching the job. Remember, on these systems, it can take minutes to launch a full-scale job! Nobody wants to sit there for that much time, only to find that the system doesn't support the requested operation.

> Brice