Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SDP support for OPEN-MPI
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-01-08 09:32:10


On Jan 8, 2008, at 7:45 AM, Lenny Verkhovsky wrote:

>> Hence, if HAVE_DECL_AF_INET_SDP==1 and using AF_INET_SDP fails to
>> that
>> peer, it might be desirable to try to fail over to using
>> AF_INET_something_else. I'm still technically on vacation :-), so I
>> didn't look *too* closely at your patch, but I think you're doing
>> that
>> (failing over if AF_INET_SDP doesn't work because of EAFNOSUPPORT),
>> which is good.
> This is actually not implemented yet.
> Supporting failing over requires opening AF_INET sockets in addition
> to
> SDP sockets, this can cause a problem in large clusters.

What I meant was try to open an SDP socket. If it fails because SDP
is not supported / available to that peer, then open a regular
socket. So you should still always have only 1 socket open to a peer
(not 2).

> If one of the machine is not supporting SDP user will get an error.

Well, that's one way to go, but it's certainly less friendly. It
means that the entire MPI job has to support SDP -- including mpirun.
What about clusters that do not have IB on the head node?

>> Perhaps a more general approach would be to [perhaps additionally]
>> provide an MCA param to allow the user to specify the AF_* value?
>> (AF_INET_SDP is a standardized value, right? I.e., will it be the
>> same on all Linux variants [and someday Solaris]?)
> I didn't find any standard on it, it seems to be "randomly" selected
> since the originally it was 26 and changed to 27 due to conflict with
> kernel's defines.

This might make an even stronger case for having an MCA param for it
-- if the AF_INET_SDP value is so broken that it's effectively random,
it may be necessary to override it on some platforms (especially in
light of binary OMPI and OFED distributions that may not match).

>> Patrick's got a good point: is there a reason not to do this?
>> (LD_PRELOAD and the like) Is it problematic with the remote orted's?
> Yes, it's problematic with remote orted's and it not really
> transparent
> as you might think.
> Since we can't pass environments' variables to the orted's during
> runtime

I think this depends on your environment. If you're not using rsh
(which you shouldn't be for a large cluster, which is where SDP would
matter most, right?), the resource manager typically copies the
environment out to the cluster nodes. So an LD_PRELOAD value should
be set for the orteds as well.

I agree that it's problematic for rsh, but that might also be solvable
(with some limits; there's only so many characters that we can pass on
the command line -- we did investigate having a wrapper to the orted
at one point to accept environment variables and then launch the
orted, but this was so problematic / klunky that we abandoned the idea).

> we must preload sdp library to each remote environment ( i.e.
> bashrc ) This will cause all applications to use SDP instead of
> AF_INET.
> Which means you can't choose specific protocol for specific
> application,
> either you are using SDP or AF_INET for all.
> SDP also can be loaded with appropriate /usr/local/ofed/etc/
> libsdp.conf
> configuration but a simple user have no access to it usually.
> (http://www.cisco.com/univercd/cc/td/doc/product/svbu/ofed/ofed_1_1/ofed
> _ug/sdp.htm#wp952927)
>
>> Andrew's got a point point here, too -- accelerating the TCP BTL with
>> SDP seems kinda pointless. I'm guessing that you did it because it
>> was just about the same work as was done in the TCP OOB (for which we
>> have no corresponding verbs interface). Is that right?
> Indeed. But it also seems that SDP has lower overhead than VERBS in
> some
> cases.

Are you referring to the fact that the avail(%) column is lower for
verbs than SDP/IPoIB? That seems like a pretty weird metric for such
small message counts. What exactly does 77.5% of 0 bytes mean?

My $0.02 is that the other columns are more compelling. :-)

> Tests with Sandia's overlapping benchmark
> http://www.cs.sandia.gov/smb/overhead.html#mozTocId316713
>
> VERBS results
> msgsize iterations iter_t work_t overhead base_t
> avail(%)
> 0 1000 16.892 15.309 1.583 7.029
> 77.5
> 2 1000 16.852 15.332 1.520 7.144
> 78.7
> 4 1000 16.932 15.312 1.620 7.128
> 77.3
> 8 1000 16.985 15.319 1.666 7.182
> 76.8
> 16 1000 16.886 15.297 1.589 7.219
> 78.0
> 32 1000 16.988 15.311 1.677 7.251
> 76.9
> 64 1000 16.944 15.299 1.645 7.457
> 77.9
>
> SDP results
> 0 1000 134.902 128.089 6.813 54.691
> 87.5
> 2 1000 135.064 128.196 6.868 55.283
> 87.6
> 4 1000 135.031 128.356 6.675 55.039
> 87.9
> 8 1000 130.460 125.908 4.552 52.010
> 91.2
> 16 1000 135.432 128.694 6.738 55.615
> 87.9
> 32 1000 135.228 128.494 6.734 55.627
> 87.9
> 64 1000 135.470 128.540 6.930 56.583
> 87.8
>
> IPoIB results
> 0 1000 252.953 247.053 5.900 119.977
> 95.1
> 2 1000 253.336 247.285 6.051 121.573
> 95.0
> 4 1000 254.147 247.041 7.106 122.110
> 94.2
> 8 1000 254.613 248.011 6.602 121.840
> 94.6
> 16 1000 255.662 247.952 7.710 124.738
> 93.8
> 32 1000 255.569 248.057 7.512 127.095
> 94.1
> 64 1000 255.867 248.308 7.559 132.858
> 94.3

-- 
Jeff Squyres
Cisco Systems