Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] SDP support for OPEN-MPI
From: Lenny Verkhovsky (lennyb_at_[hidden])
Date: 2008-01-13 08:19:07


> -----Original Message-----
> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]]
On
> Behalf Of Jeff Squyres
> Sent: Tuesday, January 08, 2008 4:32 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] SDP support for OPEN-MPI
>
> On Jan 8, 2008, at 7:45 AM, Lenny Verkhovsky wrote:
>
> >> Hence, if HAVE_DECL_AF_INET_SDP==1 and using AF_INET_SDP fails to
> >> that
> >> peer, it might be desirable to try to fail over to using
> >> AF_INET_something_else. I'm still technically on vacation :-), so
I
> >> didn't look *too* closely at your patch, but I think you're doing
> >> that
> >> (failing over if AF_INET_SDP doesn't work because of EAFNOSUPPORT),
> >> which is good.
> > This is actually not implemented yet.
> > Supporting failing over requires opening AF_INET sockets in addition
> > to
> > SDP sockets, this can cause a problem in large clusters.
>
> What I meant was try to open an SDP socket. If it fails because SDP
> is not supported / available to that peer, then open a regular
> socket. So you should still always have only 1 socket open to a peer
> (not 2).
Yes, but since the listener side doesn't know on which socket to expect
a message it will need both sockets to be opened.

>
> > If one of the machine is not supporting SDP user will get an error.
>
> Well, that's one way to go, but it's certainly less friendly. It
> means that the entire MPI job has to support SDP -- including mpirun.
> What about clusters that do not have IB on the head node?
>
They can use OOB over IP sockets and BTL on SDP, it should work.
 
> >> Perhaps a more general approach would be to [perhaps additionally]
> >> provide an MCA param to allow the user to specify the AF_* value?
> >> (AF_INET_SDP is a standardized value, right? I.e., will it be the
> >> same on all Linux variants [and someday Solaris]?)
> > I didn't find any standard on it, it seems to be "randomly" selected
> > since the originally it was 26 and changed to 27 due to conflict
with
> > kernel's defines.
>
> This might make an even stronger case for having an MCA param for it
> -- if the AF_INET_SDP value is so broken that it's effectively random,
> it may be necessary to override it on some platforms (especially in
> light of binary OMPI and OFED distributions that may not match).
>
If we talking about passing AF_INET_SDP value only then
1. Passing this value as mca parameter will not make any changes to the
SDP code.
2. Hopefully in the future AF_INET_SDP value can be gotten from the
libc,
And the value will be configured automatically.
3. If we are talking about AF_INET value in general ( IPv4, IPv6, SDP)
Then by making it constant with mca parameter we are limiting ourselves
for one protocol only without being able to failover or using different
protocols for different needs ( i.e. SDP for OOB and IPv4 for BTL )

> >> Patrick's got a good point: is there a reason not to do this?
> >> (LD_PRELOAD and the like) Is it problematic with the remote
orted's?
> > Yes, it's problematic with remote orted's and it not really
> > transparent
> > as you might think.
> > Since we can't pass environments' variables to the orted's during
> > runtime
>
> I think this depends on your environment. If you're not using rsh
> (which you shouldn't be for a large cluster, which is where SDP would
> matter most, right?), the resource manager typically copies the
> environment out to the cluster nodes. So an LD_PRELOAD value should
> be set for the orteds as well.
>
> I agree that it's problematic for rsh, but that might also be solvable
> (with some limits; there's only so many characters that we can pass on
> the command line -- we did investigate having a wrapper to the orted
> at one point to accept environment variables and then launch the
> orted, but this was so problematic / klunky that we abandoned the
idea).
>
Using LD_PRELOAD will not allow us to use SDP and IP separately, i.e.
SDP for OOB and IP for a BTL.

> > we must preload sdp library to each remote environment ( i.e.
> > bashrc ) This will cause all applications to use SDP instead of
> > AF_INET.
> > Which means you can't choose specific protocol for specific
> > application,
> > either you are using SDP or AF_INET for all.
> > SDP also can be loaded with appropriate /usr/local/ofed/etc/
> > libsdp.conf
> > configuration but a simple user have no access to it usually.
> >
(http://www.cisco.com/univercd/cc/td/doc/product/svbu/ofed/ofed_1_1/ofed
> > _ug/sdp.htm#wp952927)
> >
> >> Andrew's got a point point here, too -- accelerating the TCP BTL
with
> >> SDP seems kinda pointless. I'm guessing that you did it because it
> >> was just about the same work as was done in the TCP OOB (for which
we
> >> have no corresponding verbs interface). Is that right?
> > Indeed. But it also seems that SDP has lower overhead than VERBS in
> > some
> > cases.
>
> Are you referring to the fact that the avail(%) column is lower for
> verbs than SDP/IPoIB? That seems like a pretty weird metric for such
> small message counts. What exactly does 77.5% of 0 bytes mean?

>
> My $0.02 is that the other columns are more compelling. :-)
>
> > Tests with Sandia's overlapping benchmark
> > http://www.cs.sandia.gov/smb/overhead.html#mozTocId316713
> >
> > VERBS results
> > msgsize iterations iter_t work_t overhead base_t
> > avail(%)
> > 0 1000 16.892 15.309 1.583 7.029
> > 77.5
> > 2 1000 16.852 15.332 1.520 7.144
> > 78.7
> > 4 1000 16.932 15.312 1.620 7.128
> > 77.3
> > 8 1000 16.985 15.319 1.666 7.182
> > 76.8
> > 16 1000 16.886 15.297 1.589 7.219
> > 78.0
> > 32 1000 16.988 15.311 1.677 7.251
> > 76.9
> > 64 1000 16.944 15.299 1.645 7.457
> > 77.9
> >
> > SDP results
> > 0 1000 134.902 128.089 6.813 54.691
> > 87.5
> > 2 1000 135.064 128.196 6.868 55.283
> > 87.6
> > 4 1000 135.031 128.356 6.675 55.039
> > 87.9
> > 8 1000 130.460 125.908 4.552 52.010
> > 91.2
> > 16 1000 135.432 128.694 6.738 55.615
> > 87.9
> > 32 1000 135.228 128.494 6.734 55.627
> > 87.9
> > 64 1000 135.470 128.540 6.930 56.583
> > 87.8
> >
> > IPoIB results
> > 0 1000 252.953 247.053 5.900 119.977
> > 95.1
> > 2 1000 253.336 247.285 6.051 121.573
> > 95.0
> > 4 1000 254.147 247.041 7.106 122.110
> > 94.2
> > 8 1000 254.613 248.011 6.602 121.840
> > 94.6
> > 16 1000 255.662 247.952 7.710 124.738
> > 93.8
> > 32 1000 255.569 248.057 7.512 127.095
> > 94.1
> > 64 1000 255.867 248.308 7.559 132.858
> > 94.3
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel