Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-09-22 08:00:13

Sorry for the delay in replying -- I was in Europe for the past two weeks; travel always makes me waaaay behind on my INBOX...

On Sep 14, 2010, at 9:56 PM, 张晶 wrote:

> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I can only find a paper named "Open MPI: A Flexible High Performance MPI" and some annotation in the source file. From them , I know ob1 has implemented round-robin& weighted distribution algorithm. But after tracking the MPI_Send(),I cann't figure out
> the location of these implement ,let alone to add a new schedule algorithm.
> I have two questions :
> 1.The location of the schedule algorithm ?

It's complicated -- I'd say that the PML is probably among the most complicated sections of Open MPI because it is the main "engine" that enforces the MPI point-to-point semantics. The algorithm is fairly well distribute throughout the PML source code. :-\

> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . The function of these components?

cm: this component drives the MTL point-to-point components. It is mainly a thin wrapper for network transports that provide their own MPI-like matching semantics. Hence, most of the MPI semantics are effectively done in the lower layer (i.e., in the MTL components and their dependent libraries). You probably won't be able to do much here, because such transports (MX, Portals, etc.) do most of their semantics in the network layer -- not in Open MPI. If you have a matching network layer, this is the PML that you probably use (MX, Portals, PSM).

crcpw: this is a fork of the ob1 PML; it add some failover semantics.

csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so you can tell if the underlying transport had an error).

v: this PML uses logging and replay to effect some level of fault tolerance. It's a distant fork of the ob1 PML, but has quite a few significant differences.

ob1: this is the "main" PML that most users use (TCP, shared memory, OpenFabrics, etc.). It gangs together one or more BTLs to send/receive messages across individual network transports. Hence, it supports true multi-device/multi-rail algorithms. The BML (BTL multiplexing layer) is a thin management later that marshals all the BTLs in the process together -- it's mainly array handling, etc. The ob1 PML is the one that decides multi-rail/device splitting, etc. The INRIA folks just published a paper last week at Euro MPI about adjusting the ob1 scheduling algorithm to also take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth calculations.

Hope this helps!

Jeff Squyres
For corporate legal information go to: