Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SDP support for OPEN-MPI
From: Lenny Verkhovsky (lennyb_at_[hidden])
Date: 2008-01-13 08:19:07


> -----Original Message-----
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]]
On
> Behalf Of Jeff Squyres
> Sent: Tuesday, January 08, 2008 4:32 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] SDP support for OPEN-MPI
>
> On Jan 8, 2008, at 7:45 AM, Lenny Verkhovsky wrote:
>
> >> Hence, if HAVE_DECL_AF_INET_SDP==1 and using AF_INET_SDP fails to
> >> that
> >> peer, it might be desirable to try to fail over to using
> >> AF_INET_something_else. I'm still technically on vacation :-), so
I
> >> didn't look *too* closely at your patch, but I think you're doing
> >> that
> >> (failing over if AF_INET_SDP doesn't work because of EAFNOSUPPORT),
> >> which is good.
> > This is actually not implemented yet.
> > Supporting failing over requires opening AF_INET sockets in addition
> > to
> > SDP sockets, this can cause a problem in large clusters.
>
> What I meant was try to open an SDP socket. If it fails because SDP
> is not supported / available to that peer, then open a regular
> socket. So you should still always have only 1 socket open to a peer
> (not 2).
Yes, but since the listener side doesn't know on which socket to expect
a message it will need both sockets to be opened.

>
> > If one of the machine is not supporting SDP user will get an error.
>
> Well, that's one way to go, but it's certainly less friendly. It
> means that the entire MPI job has to support SDP -- including mpirun.
> What about clusters that do not have IB on the head node?
>
They can use OOB over IP sockets and BTL on SDP, it should work.
 
> >> Perhaps a more general approach would be to [perhaps additionally]
> >> provide an MCA param to allow the user to specify the AF_* value?
> >> (AF_INET_SDP is a standardized value, right? I.e., will it be the
> >> same on all Linux variants [and someday Solaris]?)
> > I didn't find any standard on it, it seems to be "randomly" selected
> > since the originally it was 26 and changed to 27 due to conflict
with
> > kernel's defines.
>
> This might make an even stronger case for having an MCA param for it
> -- if the AF_INET_SDP value is so broken that it's effectively random,
> it may be necessary to override it on some platforms (especially in
> light of binary OMPI and OFED distributions that may not match).
>
If we talking about passing AF_INET_SDP value only then
1. Passing this value as mca parameter will not make any changes to the
SDP code.
2. Hopefully in the future AF_INET_SDP value can be gotten from the
libc,
And the value will be configured automatically.
3. If we are talking about AF_INET value in general ( IPv4, IPv6, SDP)
Then by making it constant with mca parameter we are limiting ourselves
for one protocol only without being able to failover or using different
protocols for different needs ( i.e. SDP for OOB and IPv4 for BTL )

> >> Patrick's got a good point: is there a reason not to do this?
> >> (LD_PRELOAD and the like) Is it problematic with the remote
orted's?
> > Yes, it's problematic with remote orted's and it not really
> > transparent
> > as you might think.
> > Since we can't pass environments' variables to the orted's during
> > runtime
>
> I think this depends on your environment. If you're not using rsh
> (which you shouldn't be for a large cluster, which is where SDP would
> matter most, right?), the resource manager typically copies the
> environment out to the cluster nodes. So an LD_PRELOAD value should
> be set for the orteds as well.
>
> I agree that it's problematic for rsh, but that might also be solvable
> (with some limits; there's only so many characters that we can pass on
> the command line -- we did investigate having a wrapper to the orted
> at one point to accept environment variables and then launch the
> orted, but this was so problematic / klunky that we abandoned the
idea).
>
Using LD_PRELOAD will not allow us to use SDP and IP separately, i.e.
SDP for OOB and IP for a BTL.

> > we must preload sdp library to each remote environment ( i.e.
> > bashrc ) This will cause all applications to use SDP instead of
> > AF_INET.
> > Which means you can't choose specific protocol for specific
> > application,
> > either you are using SDP or AF_INET for all.
> > SDP also can be loaded with appropriate /usr/local/ofed/etc/
> > libsdp.conf
> > configuration but a simple user have no access to it usually.
> >
(http://www.cisco.com/univercd/cc/td/doc/product/svbu/ofed/ofed_1_1/ofed
> > _ug/sdp.htm#wp952927)
> >
> >> Andrew's got a point point here, too -- accelerating the TCP BTL
with
> >> SDP seems kinda pointless. I'm guessing that you did it because it
> >> was just about the same work as was done in the TCP OOB (for which
we
> >> have no corresponding verbs interface). Is that right?
> > Indeed. But it also seems that SDP has lower overhead than VERBS in
> > some
> > cases.
>
> Are you referring to the fact that the avail(%) column is lower for
> verbs than SDP/IPoIB? That seems like a pretty weird metric for such
> small message counts. What exactly does 77.5% of 0 bytes mean?

>
> My $0.02 is that the other columns are more compelling. :-)
>
> > Tests with Sandia's overlapping benchmark
> > http://www.cs.sandia.gov/smb/overhead.html#mozTocId316713
> >
> > VERBS results
> > msgsize iterations iter_t work_t overhead base_t
> > avail(%)
> > 0 1000 16.892 15.309 1.583 7.029
> > 77.5
> > 2 1000 16.852 15.332 1.520 7.144
> > 78.7
> > 4 1000 16.932 15.312 1.620 7.128
> > 77.3
> > 8 1000 16.985 15.319 1.666 7.182
> > 76.8
> > 16 1000 16.886 15.297 1.589 7.219
> > 78.0
> > 32 1000 16.988 15.311 1.677 7.251
> > 76.9
> > 64 1000 16.944 15.299 1.645 7.457
> > 77.9
> >
> > SDP results
> > 0 1000 134.902 128.089 6.813 54.691
> > 87.5
> > 2 1000 135.064 128.196 6.868 55.283
> > 87.6
> > 4 1000 135.031 128.356 6.675 55.039
> > 87.9
> > 8 1000 130.460 125.908 4.552 52.010
> > 91.2
> > 16 1000 135.432 128.694 6.738 55.615
> > 87.9
> > 32 1000 135.228 128.494 6.734 55.627
> > 87.9
> > 64 1000 135.470 128.540 6.930 56.583
> > 87.8
> >
> > IPoIB results
> > 0 1000 252.953 247.053 5.900 119.977
> > 95.1
> > 2 1000 253.336 247.285 6.051 121.573
> > 95.0
> > 4 1000 254.147 247.041 7.106 122.110
> > 94.2
> > 8 1000 254.613 248.011 6.602 121.840
> > 94.6
> > 16 1000 255.662 247.952 7.710 124.738
> > 93.8
> > 32 1000 255.569 248.057 7.512 127.095
> > 94.1
> > 64 1000 255.867 248.308 7.559 132.858
> > 94.3
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel