Just an FYI - I asked a similar question recently and got the following answer from Rolf:

In my case, it was specific to openib only and it required you to be running with two or more IB rails.
Then, if one of them failed, we just shut it down, and continued with the working ones.
You could only get use of the failing rail if it was fixed and a new job was started.

To get this to work, I created a new PML called bfo.  I also had to make some changes in the openib BTL.
By default, none of the code is configured in.  There is a README in the PML bfo directory that 
actually does quite a good job explaining what I did.

The bfo module is included in the 1.6 series, and in the upcoming 1.7 series. Can't say anything as to its state of repair.


On Oct 25, 2012, at 10:41 AM, George Bosilca <bosilca@icl.utk.edu> wrote:


On Oct 25, 2012, at 17:54 , Lirong Jian <lirong.misc@gmail.com> wrote:

Hi foks,

Sorry to bother you guys, but I have some questions about Open MPI and really want your help.

There are some papers (e.g., [1, 2, 3], although they are sort of old-aged) mentioning that Open MPI is supporting NIC failover and message stripping over multiple NICs. However, when I read the source code of openmpi-1.6.2, I couldn't find any component named DR or TEG (which are mentioned in those papers and are supposed to support NIC failover and message stripping). So my question is:

Does the 1.6.2 release of Open MPI support such two kinds of functionalities? If positive, which part of code is corresponding to these functionalities?

Lirong,

As you noticed the papers are quite old and dusty.

Due to a lack of interest from the community the DR PML has been retired from out stable releases. In other terms no stable Open MPI version supports network failover. However, the code is still available in the trunk, but there is no guarantee it still does what it was designed for.

TEG has been replaced with OB1, which is our current network management layer. It does stripping over multiple NICs (identical or not) by default.

  george.


Many thanks in advance.

P.S., I am a newbie of this domain. Maybe my questions are simple even naive, but your help would be highly appreciated.

Best,
Lirong


[1] Network Fault Tolerance in Open MPI.
[2] Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications.
[3] TEG: A High-Performance, Scalable, Multi-network, Point-to-Point, Communications Methodology.
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel