I tried a configuration on a small test cluster similar to your Approach 1,
with interesting (promising) results. While the topology is deterministic, I
found the actual performance is under-determined in practice - depending on
the symmetry and partitioning of the tasks and the data.
Your second approach is understandable, as a generalized (sub-optimal)
solution, but I ended up abandoning hard-coded/hard-wired topologies in
favor of a more dynamical approach in order to improve efficiency and
effectiveness of our compute fabric. This approach is dependent upon a
context of several factors, such as concurrent schedules, priorities,
existing configurations, specific task and data partitions, etc. I'm afraid
I cannot be more specific at this time.
There are myriad resulting topologies - some patterns already identified,
some almost indescribable (at this time). The determining factor is usually
the structure of the existing codes and particular data reduction/partition
of the job. I found the hardware and topologies closely coupled with the
existing software and data, which provides the constraint.
> -----Original Message-----
> From: devel-bounces_at_[hidden]
> [mailto:devel-bounces_at_[hidden]] On Behalf Of Luigi Scorzato
> Sent: Friday, October 30, 2009 2:47 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] RFC: revamp topo framework
> I am very interested in this, but let me explain in more
> details my present situation and goals.
> I am working in a group who is testing a system under
> development which is connected with both:
> - an ordinary all to all standard interface (where open-mpi
> is already available) but with limited performances and scalability.
> - a custom 3D torus network, with no mpi available, custom
> low-level communication primitives (under development), from
> which we expect higher performance and scalability.
> I have two approaches in mind:
> 1st approach.
> Use the standard network interface to setup MPI. However,
> through a precompilation step, redefine a few MPI_ functions
> MPI_Recv() and others) such that they call the torus
> primitives, if the communication is between nearest
> neighbors, and fall back into standard MPI through the
> standard interface if not. This can only work if I can choose
> the mpi-ranks of my system in a way that
> MPI_Cart_create() will generate coordinates consistent with
> the physical topology.
> ***There must be a place - somewhere in the open-mpi code -
> where the cartesian coordinates are chosen, presumably as a
> deterministic function of the mpi-ranks and the dimensions
> (as given by MPI_Dims_create). I expected it to be in
> MPI_Cart_create(). But I could not find it. Can anyone
> help?*** This approach has obvious limitations of
> portability, besides requiring the availability of a fallback
> network, but it gives me full control of what I need to do,
> which is essential since my primary goal is to get a few
> important codes working in the new system asap.
> 2nd approach.
> Develop a new "torus" topo component, as explained by Jeff.
> This is certainly the *right* solution, but there are two problems:
> - because of my poor familiarity with the open-mpi source
> code, I am not able to estimate how long it will take me.
> - in a first phase, the torus primitives will not support all
> to all communications but only nearest neighbors ones. Hence,
> full portability is excluded anyway and/or a fallback network
> still needed. In other words, the topo component should be
> able to deal with two networks, and I have no idea of how
> much this will complicate things.
> I necessarily have to push the 1st approach, for the moment,
> but I am very much interested in studying the 2nd and if I
> see that it is realistic (given the limitations above) and
> safe, I may turn to it completely.
> thanks for your feedback and best regards, Luigi
> devel mailing list