Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] initial SCTP BTL commit comments?
From: Brad Penoff (penoff_at_[hidden])
Date: 2007-11-10 21:25:02

On Nov 10, 2007 9:42 AM, Andrew Friedley <afriedle_at_[hidden]> wrote:
> Brad Penoff wrote:
> > Any objections to us committing an SCTP BTL to ompi-trunk if it has
> > the ompi_ignore file in it first?
> I'd like to see this in the trunk, though I'd guess that others will
> want to know how you plan to support/maintain this code long-term once
> it's in. I don't think an ompi_ignore is necessary either, as long as
> your configure checks are right.

Currently, our research involves the use of SCTP and MPI and so long
as that continues, we will continue contributing bug fixes and some of
the enhancements (e.g. use of SCTP multistreaming in an MTL (where MPI
details are more exposed), use of a btl_progress function, etc.).

> Do you have any publications on this work?

Certainly. Our team's webpage is here, although admittedly I should update it:

Our first marquee publication was at SC|05; that was a LAM based
implementation. Our MPICH2 ch3:sctp was released in Dec 2006 and is
currently used by at least the FreeBSD and Mac OS X stack developers.
It's only been the past months where we've had the time and people to
devote to Open MPI support.

Probably the most relevant paper is the Euro PVM/MPI 2007 paper where
we compared our MPICH2 ch3:sctp channel that uses SCTP's multihoming
feature coupled together with CMT (concurrent multipath transfer) to
Open MPI's middleware-level striping. CMT does some of the
functionality that Open MPI does but in the kernel rather.

> > For fault tolerance purposes, SCTP connections (termed "associations")
> > can be made aware of multiple interfaces on the endpoints by binding
> > to more than one interface (for performance, the CMT extension uses
> > this multihoming feature to stripe data). SCTP also has several
> > different APIs that it supports. Like TCP, there can be a one-to-one
> > socket per connection. Another option is that like UDP, there can be
> > a single one-to-many socket that is used for all connections. The
> > SCTP BTL has the option of using either socket style, depending on the
> > value of the btl_sctp_if_11 MCA option. When this value is 1, the
> > one-to-one socket is used and like the TCP BTL, there are as many BTL
> > component modules as the number of network cards specified with
> > if_include and friends. By default, this value is 0 which means that
> > a single one-to-many socket is used; here only one BTL module is used
> > and internally, SCTP itself handles within that one socket all the
> > network cards specified with if_include, etc.
> Sounds like a good setup. Have you done performance/resource
> utilization/scaling comparisons of the two approaches, as well as how
> they compare to the TCP BTL?

You must have read our minds because actually we are doing performance
and resource utilization comparisons right now, extending our Euro
PVM/MPI 2007 work for a journal. The CPU numbers are currently being
obtained and scrutinized. The OSU bandwidth tests show that the SCTP
BTL (both one-to-many and one-to-one) both behave comparably to TCP on
FreeBSD. Karol may be able to comment more on this. We hope to
better the performance with some of the future middleware enhancements
mentioned, as well as some in the protocol/kernel.

It must be said that, in general, SCTP performance is incredibly stack
dependent. The FreeBSD stack is the most bug-free. The Mac OS X
stack uses the same code base, for the most part. The Linux stack
tends to be slightly less dependable than the FreeBSD one, mostly
because of the Linux stack's relative age. A frustration of the user
of a new stack is that sometimes hangs are difficult to tell if they
are the fault of the stack or the fault of the application/middleware.
 Our hope is that expanding the user base of SCTP (by adding support
to Open MPI) will result in increased usage and therefore stronger
SCTP stacks (as a result of bug reports) on all platforms.

> > Currently, both the one-to-one and the one-to-many make use of the
> > event library offered by Open MPI. The callback functions for the
> > one-to-many style however are quite unique as multiple endpoints may
> > be interested in the events that poll returns. Currently we use these
> > unique callback functions, but in the future the hope is to play with
> > the potential benefits of a btl_progress function, particularly for
> > the one-to-many style.
> In my experience the event callbacks have a high overhead compared to a
> progress function, so I'd say thats definitely worth checking out.

We noticed that poll is only called after a timer goes off while
btl_progress would be called with each iteration of opal_progress, so
noticing that along with you encouragement makes us want to check it
out even more.

Thanks for your comments,

> Andrew
> _______________________________________________
> devel mailing list
> devel_at_[hidden]