Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] carto vs. hwloc
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-15 14:20:49


On Dec 15, 2009, at 7:41 AM, Terry Dontje wrote:

> Kenneth Lloyd wrote:
>> My 2 cents: Carto is a weighted graph structure that describes the topology
>> of the compute cluster, not just locations of nodes. Many view topologies
>> (trees, meshes, torii) to be static - but I've found this an unnecessary and
>> undesirable constraint.
>>
>> The compute fabric may better be left open to dynamic configuration,
>> dependent upon the partitioning of jobs, tasks and data to be run.
>>
>> How do others see this?
>>
>>
> At a network and actually even a node's resource level I think a case can be made for a dynamically changing topology as you mention above. However, is MPI the right level to compensate for interfaces coming and going?

It probably should be done at a lower level, but it begs a different question. For example, I've created the capability in the new cluster manager to detect interfaces that are lost, ride through the problem by moving affected procs to other nodes (reconnecting ORTE-level comm), and move procs back if/when nodes reappear. So someone can remove a node "on-the-fly" and replace that hardware with another node without having to stop and restart the job, etc. A lot of that infrastructure is now down inside ORTE, though a few key pieces remain in the ORCM code base (and most likely will stay there).

Works great - unless it is an MPI job. If we can figure out a way for the MPI procs to (a) be properly restarted on the "new" node, and (b) update the BTL connection info on the other MPI procs in the job, then we would be good to go...

Trivial problem, I am sure :-)

> It would be nice/cool if there was an APM like feature that spanned HCAs and not just between ports on the same HCA available at a network api level. I know why this is currently done the way it is for IB but it always struck me that you'd want to handle interface/path changes below MPI. That way more than just MPI codes could reap the benefits.
> At a node level the whole locality issue of a process in relation to its memory or other processes seem to cry out to possibly be more of a OS type of job than MPI. Reason being is first you could end up with quite a complex layout for a job and second things really become complicated if you want to take into account other MPI jobs.
>
> The above being said, I don't hold too much hope that things below MPI will actually take on these tasks, even though it seems like a logical level for these things to occur IMO.
>
> Anyways, I think keeping in mind dynamic changes is well worth it but it seems to start moving there from a static position makes a lot of sense.
>
> --td
>> Ken Lloyd
>>
>>
>>> -----Original Message-----
>>> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]] On Behalf Of Jeff Squyres
>>> Sent: Monday, December 14, 2009 6:47 PM
>>> To: Open MPI Developers List
>>> Subject: Re: [OMPI devel] carto vs. hwloc
>>>
>>> I had a nice chat with Ralph this afternoon about this topic.
>>>
>>> He pointed out a few things to me:
>>>
>>> - I had forgotten (ahem) that carto has weights associated with each of its edges (and that's kind of a defining feature). hwloc, at present, does not. So perhaps hwloc would not initially replace carto -- maybe in some future future hwloc version.
>>>
>>> - He also pointed out that not only paffinity, but also sysinfo, could be replaced if hwloc comes in.
>>>
>>> He also made a good point that hwloc is only "sorta" extensible right now -- meaning that, sure, you can add support for new OS's and platforms, but not in as easy/clean a way as we have in Open MPI. Specifically, adding new support right now means editing much of the current hwloc code: configure, adding #if's to the top-level tools and library core, etc. It's not nearly as clean as just adding a new plugin that is totally independent of the rest of the code base. He thought it would be [greatly] beneficial if hwloc uses the same plugin system as Open MPI before bringing it in. Indeed, Open MPI may wish to extend hwloc in ways that the main hwloc project is not interested in extending (e.g., supporting some of Cisco's custom hardware). Fair point.
>>>
>>> Additionally, the topic of plugins came up within the context of heterogeneity: have code to get the topology of the machine (RAM + processors), but have separate code to mix in accelerators/co-processors and other entities in the box. One could easily imagine plugins for each different type of entity that you would want to detect within a server.
>>>
>>> To some extent, the hwloc crew has already been discussing these issues -- we can probably work elements of much of it into what we're doing. For example, Brice and Samuel are working on adding PCI device support to hwloc (although I haven't been following the details of what they're doing). We've also talked about adding hwloc functions for editing the map that comes back. For example, hwloc could be used as the cornerstone for a new OPAL framework base, and new plugins in this base can use functions to add more information to the initial map that is reported back by the hwloc core. [shrug] Need to think about that more.
>>>
>>> This is all excellent feedback (I need to take it back to the hwloc crew); please let me know what else you think about these ideas tomorrow on the call.
>>>
>>>
>>>
>>> On Dec 14, 2009, at 4:13 PM, Jeff Squyres wrote:
>>>
>>>
>>>> Question for everyone (possibly a topic for tomorrow's call...):
>>>>
>>>> hwloc is evolving into a fairly nice package. It's not
>>> ready for inclusion into Open MPI yet, but it's getting there. I predict it will come in somewhere early in the 1.5 series (potentially not 1.5.0, though). hwloc will provide two things:
>>>
>>>> 1. A listing of all processors and memory, to include
>>> caches (and cache sizes!) laid out in a map, so you can see what processors share what memory (e.g., caches). Open MPI currently does not have this capability. Additionally, hwloc is currently growing support to include PCI devices in the map; that may make it into hwloc v1.0 or not.
>>>
>>>> 2. Cross-platform / OS support. hwloc currently support a
>>> nice variety of OSs and hardware platforms.
>>>
>>>> Given that hwloc is already cross-platform, do we really
>>> need the carto framework? I.e., do we really need multiple carto plugins? More specifically: should we just use hwloc directly -- with no framework?
>>>> Random points:
>>>>
>>>> - I'm about halfway finished with "embedding" code for
>>> hwloc like PLPA has, so, for example, all of hwloc's symbols can be prepended with opal_ or orte_ or whatever. Hence, embedding hwloc in OMPI would be "safe".
>>>
>>>> - If we keep the carto framework, then we'll have to
>>> translate from hwloc's map to carto's map; there may be subtleties involved in the translation.
>>>> - I guarantee that [much] more thought has been put into
>>> the hwloc map data structure design than carto's. :-) Indeed, to make all of hwloc's data available to OMPI, carto's map data structures may end up evolving to look pretty much exactly like hwloc's. In which case -- what's the point of carto?
>>>
>>>> Thoughts?
>>>>
>>>> hwloc also provides processor binding functions, so it
>>> might also make the paffinity framework moot...
>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>>
>>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel