The current way Open MPI is selecting the network to be used between
processes, match very well the first approach you proposed. As we
support multiple networks simultaneously, a BTL (the low level network
driver) can service only a subset of peers. All other communications
will automatically be redirected through another BTL (which has to be
available). In the past there were some attempts to route messages but
this code is not in the trunk.
On Oct 30, 2009, at 04:47 , Luigi Scorzato wrote:
> I am very interested in this, but let me explain in more details my
> present situation and goals.
> I am working in a group who is testing a system under development
> which is connected with both:
> - an ordinary all to all standard interface (where open-mpi is
> already available) but with limited performances and scalability.
> - a custom 3D torus network, with no mpi available, custom low-level
> communication primitives (under development), from which we expect
> higher performance and scalability.
> I have two approaches in mind:
> 1st approach.
> Use the standard network interface to setup MPI. However, through a
> precompilation step, redefine a few MPI_ functions (MPI_Send()
> MPI_Recv() and others) such that they call the torus primitives, if
> the communication is between nearest neighbors, and fall back into
> standard MPI through the standard interface if not. This can only
> work if I can choose the mpi-ranks of my system in a way that
> MPI_Cart_create() will generate coordinates consistent with the
> physical topology.
> ***There must be a place - somewhere in the open-mpi code - where
> the cartesian coordinates are chosen, presumably as a deterministic
> function of the mpi-ranks and the dimensions (as given by
> MPI_Dims_create). I expected it to be in MPI_Cart_create(). But I
> could not find it. Can anyone help?***
> This approach has obvious limitations of portability, besides
> requiring the availability of a fallback network, but it gives me
> full control of what I need to do, which is essential since my
> primary goal is to get a few important codes working in the new
> system asap.
> 2nd approach.
> Develop a new "torus" topo component, as explained by Jeff. This is
> certainly the *right* solution, but there are two problems:
> - because of my poor familiarity with the open-mpi source code, I am
> not able to estimate how long it will take me.
> - in a first phase, the torus primitives will not support all to all
> communications but only nearest neighbors ones. Hence, full
> portability is excluded anyway and/or a fallback network still
> needed. In other words, the topo component should be able to deal
> with two networks, and I have no idea of how much this will
> complicate things.
> I necessarily have to push the 1st approach, for the moment, but I
> am very much interested in studying the 2nd and if I see that it is
> realistic (given the limitations above) and safe, I may turn to it
> thanks for your feedback and best regards, Luigi
> devel mailing list