Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] initial SCTP BTL commit comments?
From: Brad Penoff (penoff_at_[hidden])
Date: 2007-11-13 20:26:17


On Nov 13, 2007 12:41 PM, Brad Penoff <penoff_at_[hidden]> wrote:
> On Nov 12, 2007 3:26 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > I have no objections to bringing this into the trunk, but I agree that
> > an .ompi_ignore is probably a good idea at first.
>
> I'll try to cook up a commit soon then!

It's in there now!
https://svn.open-mpi.org/trac/ompi/changeset/16723

A quick sanity test shows that things are operational. For others to
use it, they'll have to of course adjust ompi_ignore (or
.ompi_unignore).

We're playing with MTT now so I'd expect we'll have some questions on
that front in the near future.

Where is the best place to put BTL-specific documentation (for
example, some setup tips and weblinks)?

brad

>
> > One question that I'd like to have answered is how OMPI decides
> > whether to use the SCTP BTL or not. If there are SCTP stacks
> > available by default in Linux and OS X -- but their performance may be
> > sub-optimal and/or buggy, we may want to have the SCTP BTL only
> > activated if the user explicitly asks for it. Open MPI is very
> > concerned with "out of the box" behavior -- we need to ensure that
> > "mpirun a.out" will "just work" on all of our supported platforms.
>
> Just to make a few things explicit...
>
> Things would only work out of the box on FreeBSD, and there the stack
> is very good.
>
> We have less experience with the Linux stack but hope the availability
> of and SCTP BTL will help encourage its use by us and others. Now it
> is a module by default (loaded with "modprobe sctp") but the actual
> SCTP sockets extension API needs to be downloaded and installed
> separately. The so-called lksctp-tools can be obtained here:
> http://sourceforge.net/project/showfiles.php?group_id=26529
>
> The OS X stack does not come by default but instead is a kernel extension:
> http://sctp.fh-muenster.de/sctp-nke.html
> I haven't yet started this testing but intend to soon. As of now
> though, the supplied configure.m4 does not try to even build the
> component on Mac OS X.
>
> So in my opinion, things in the configure scripts should be fine the
> way the are since only FreeBSD stack (which we have confidence in)
> will try to work out of the box; the others require the user to
> install things.
>
>
> A question I had was with respect to what to set for the default value
> of btl_sctp_exclusivity... I had wanted the exclusivity to be
> "slightly less than TCP" so it was available but not the default. In
> the code I set btl_sctp_exclusivity to this:
> MCA_BTL_EXCLUSIVITY_LOW - 1
> ...however MCA_BTL_EXCLUSIVITY_LOW is defined as 0 and ompi_info says
> that exclusivity must be >= 0... a -1 exclusivity doesn't seem to
> break anything though... If two BTLs have the same exclusivity, what
> is the tie-break? Alphabetic order?
>
> >
> > Will UBC setup regular MTT runs to test the SCTP stuff? :-)
> >
>
> We've only started playing with MTT so I'm sure we'll have plenty of
> questions as we begin this process!
>
>
> > More below.
> >
> >
> > On Nov 10, 2007, at 9:25 PM, Brad Penoff wrote:
> >
> > >>> Currently, both the one-to-one and the one-to-many make use of the
> > >>> event library offered by Open MPI. The callback functions for the
> > >>> one-to-many style however are quite unique as multiple endpoints may
> > >>> be interested in the events that poll returns. Currently we use
> > >>> these
> > >>> unique callback functions, but in the future the hope is to play
> > >>> with
> > >>> the potential benefits of a btl_progress function, particularly for
> > >>> the one-to-many style.
> > >>
> > >> In my experience the event callbacks have a high overhead compared
> > >> to a
> > >> progress function, so I'd say thats definitely worth checking out.
> > >
> > > We noticed that poll is only called after a timer goes off while
> > > btl_progress would be called with each iteration of opal_progress, so
> > > noticing that along with you encouragement makes us want to check it
> > > out even more.
> >
> >
> > Be aware that based on discussions from the Paris meeting, some
> > changes to libevent are coming (I really need to get this on a wiki
> > page or something). Here's a quick summary:
> >
> > - We're waiting for a new release of libevent (or libev -- we'll see
> > how it shakes out) that has lots of bug fixes and performance
> > improvements as compared to the version we currently have in the OMPI
> > tree. Based on some libevent mailing list traffic, this release may
> > be in Dec 2007. We'll see what happens.
> >
> > - After we update libevent, we'll be making a policy change w.r.t.
> > OMPI progress functions and timer callbacks: only software layers with
> > actual devices will be allowed to register progress functions (in
> > particular, the io and osd framework progress functions will be
> > eliminated; see below). All other progress-requiring functions will
> > have to use timers. This means that every time we call progress, we
> > *only* call the stuff that needs to be polled as frequently as
> > possible. We'll call the less-important progress stuff less
> > frequently (e.g., ORTE OOB/RML).
> >
> > - We'll be changing our use of libevent to utilize the more scalable
> > polling capabilities (such as epoll and friends). We don't use them
> > right now because on all OS's that we currently care about (Linux, OS
> > X, Solaris), mixing the scalable fd polling mechanism with pty's
> > results in Very Very Bad Things. We'll special case where pty's are
> > used and only use select/poll there, and then use epoll (etc.)
> > elsewhere.
> >
> > - We'll also be changing our use of libevent to utilized timers
> > properly.
> >
> > - ompi_request_t will be augmented to have a callback that, if non-
> > NULL, will be invoked when the request is completed. This will allow
> > removing the io and osd framework progress functions.
> >
> > - We may also add a high-performance clock framework in Open MPI -- a
> > way of accessing high-resolution timers and clocks on the host (e.g.,
> > on Intel chips, additional algorithms are necessary to normalize the
> > per-chip clocks between sockets, especially if a process bounces
> > between sockets -- unnecessary on AMD, PPC, and SPARC platforms).
> > This could improve performance and precision of the libevent timers.
> >
> > - Finally, registering progress functions will take a new parameter: a
> > file descriptor. If a file descriptor is provided and opal_progress()
> > decides that it wants to block (specific mechanism TBD, but probably
> > something similar to what other hybrid polling/blocking systems do:
> > poll for a while, and if nothing "interesting" happens, block) *and*
> > if all registered progress functions have valid fd's, then we'll block
> > until either a timer expires or something "interesting" happens.
> >
>
> Thanks for the update on the things to come! We'll definitely keep an
> eye on things as they arrive.
>
> brad
>
> > --
>
> > Jeff Squyres
> > Cisco Systems
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
>