Subject: Re: [OMPI devel] SDP support for OPEN-MPI
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-01-14 23:12:34

On Jan 13, 2008, at 8:19 AM, Lenny Verkhovsky wrote:

> > What I meant was try to open an SDP socket. If it fails because SDP
> > is not supported / available to that peer, then open a regular
> > socket. So you should still always have only 1 socket open to a
> peer
> > (not 2).
> Yes, but since the listener side doesn't know on which socket to
> expect
> a message it will need both sockets to be opened.

Ah, you meant the listener socket -- not 2 sockets to each peer. Ok,
fair enough. Opening up one more listener socket in each process is
no big deal (IMO).

> > > If one of the machine is not supporting SDP user will get an
> error.
> >
> > Well, that's one way to go, but it's certainly less friendly. It
> > means that the entire MPI job has to support SDP -- including
> mpirun.
> > What about clusters that do not have IB on the head node?
> >
> They can use OOB over IP sockets and BTL on SDP, it should work.

Yes, I'm fine with this -- IIRC, my point was that if SDP is not
available (and the user didn't explicitly ask for it), then it should
not be an error.

> > >> Perhaps a more general approach would be to [perhaps
> additionally]
> > >> provide an MCA param to allow the user to specify the AF_* value?
> > >> (AF_INET_SDP is a standardized value, right? I.e., will it be
> the
> > >> same on all Linux variants [and someday Solaris]?)
> > > I didn't find any standard on it, it seems to be "randomly"
> selected
> > > since the originally it was 26 and changed to 27 due to conflict
> with
> > > kernel's defines.
> >
> > This might make an even stronger case for having an MCA param for it
> > -- if the AF_INET_SDP value is so broken that it's effectively
> random,
> > it may be necessary to override it on some platforms (especially in
> > light of binary OMPI and OFED distributions that may not match).
> >
> If we talking about passing AF_INET_SDP value only then
> 1. Passing this value as mca parameter will not make any changes to
> the
> SDP code.
> 2. Hopefully in the future AF_INET_SDP value can be gotten from the
> libc,
> And the value will be configured automatically.
> 3. If we are talking about AF_INET value in general ( IPv4, IPv6, SDP)
> Then by making it constant with mca parameter we are limiting
> ourselves
> for one protocol only without being able to failover or using
> different
> protocols for different needs ( i.e. SDP for OOB and IPv4 for BTL )

I'm not sure what you mean. The AF_INET values for v4 and v6 are
constantly compiled into OMPI via whatever values they are in the
system header files. They're standardized values, right?

My understanding of what you were saying was that AF_INET_SDP is *not*
standardized such that it may actually be different values on
different systems. Hence, an MPI app could be otherwise portable but
have a wrong value for AF_INET_SDP compiled into its code.

Are you saying something else?

> > >> Patrick's got a good point: is there a reason not to do this?
> > >> (LD_PRELOAD and the like) Is it problematic with the remote
> orted's?
> > > Yes, it's problematic with remote orted's and it not really
> > > transparent
> > > as you might think.
> > > Since we can't pass environments' variables to the orted's during
> > > runtime
> >
> > I think this depends on your environment. If you're not using rsh
> > (which you shouldn't be for a large cluster, which is where SDP
> would
> > matter most, right?), the resource manager typically copies the
> > environment out to the cluster nodes. So an LD_PRELOAD value should
> > be set for the orteds as well.
> >
> > I agree that it's problematic for rsh, but that might also be
> solvable
> > (with some limits; there's only so many characters that we can
> pass on
> > the command line -- we did investigate having a wrapper to the orted
> > at one point to accept environment variables and then launch the
> > orted, but this was so problematic / klunky that we abandoned the
> idea).
> >
> Using LD_PRELOAD will not allow us to use SDP and IP separately, i.e.
> SDP for OOB and IP for a BTL.

Why would you want to do that? I would think that the biggest win
here would be SDP for OOB -- the heck with the BTL. The BTL was just
done for completeness (right?); if you have OpenFabrics support, you
should be using the verbs BTL.

Perhaps I don't understand exactly what you are proposing. I was
under the impression that you were going after a common case: mpirun
and the MPI jobs are running on back-end compute nodes where all of
them support SDP (although the other case of mpirun running on the
head node without SDP and all the MPI processes are running on back-
end nodes with SDP is also not-uncommon...). Are you thinking of
something else, or are you looking for more flexibility?

Jeff Squyres
Cisco Systems