Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
From: Joshua Ladd (jladd.mlnx_at_[hidden])
Date: 2014-05-08 09:10:00


Hi, Adam

We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually
push upstream. However, to use it, it will require linking in a proprietary
Mellanox library that accelerates the collective operations (available in
MOFED versions 2.1 and higher.) Similar in spirit to the MXM MTL or FCA
COLL components in OMPI.

Best,

Josh

On Wed, May 7, 2014 at 11:45 AM, Moody, Adam T. <moody20_at_[hidden]> wrote:

> Hi Josh,
> Are your changes to OMPI or SLURM's PMI2 implementation? Do you plan to
> push those changes back upstream?
> -Adam
>
>
> ------------------------------
> *From:* devel [devel-bounces_at_[hidden]] on behalf of Joshua Ladd [
> jladd.mlnx_at_[hidden]]
> *Sent:* Wednesday, May 07, 2014 7:56 AM
> *To:* Open MPI Developers
>
> *Subject:* Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is
> specifically requested
>
> Ah, I see. Sorry for the reactionary comment - but this feature falls
> squarely within my "jurisdiction", and we've invested a lot in improving
> OMPI jobstart under srun.
>
> That being said (now that I've taken some deep breaths and carefully read
> your original email :)), what you're proposing isn't a bad idea. I think it
> would be good to maybe add a "--with-pmi2" flag to configure since
> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
> hack the installation.
>
> Josh
>
>
> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Okay, then we'll just have to develop a workaround for all those Slurm
>> releases where PMI-2 is borked :-(
>>
>> FWIW: I think people misunderstood my statement. I specifically did
>> *not* propose to *lose* PMI-2 support. I suggested that we change it to
>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
>> stabilized, then we could reverse that policy.
>>
>> However, given that both you and Chris appear to prefer to keep it
>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
>> broken and then fall back to PMI-1.
>>
>>
>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>
>> Just saw this thread, and I second Chris' observations: at scale we
>> are seeing huge gains in jobstart performance with PMI2 over PMI1. We
>> *CANNOT* loose this functionality. For competitive reasons, I cannot
>> provide exact numbers, but let's say the difference is in the ballpark of a
>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
>> but there is no contest between PMI1 and PMI2. We (MLNX) are actively
>> working to resolve some of the scalability issues in PMI2.
>>
>> Josh
>>
>> Joshua S. Ladd
>> Staff Engineer, HPC Software
>> Mellanox Technologies
>>
>> Email: joshual_at_[hidden]
>>
>>
>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Interesting - how many nodes were involved? As I said, the bad scaling
>>> becomes more evident at a fairly high node count.
>>>
>>> On May 7, 2014, at 12:07 AM, Christopher Samuel <samuel_at_[hidden]>
>>> wrote:
>>>
>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>> > Hash: SHA1
>>> >
>>> > Hiya Ralph,
>>> >
>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>> >
>>> >> I should have looked closer to see the numbers you posted, Chris -
>>> >> those include time for MPI wireup. So what you are seeing is that
>>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>>> >> than PMI. I suspect that PMI2 is not much better as the primary
>>> >> reason for the difference is that mpriun sends blobs, while PMI
>>> >> requires that everything be encoded into strings and sent in little
>>> >> pieces.
>>> >>
>>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>>> >> operation) much faster, and MPI_Init completes faster. Rest of the
>>> >> computation should be the same, so long compute apps will see the
>>> >> difference narrow considerably.
>>> >
>>> > Unfortunately it looks like I had an enthusiastic cleanup at some point
>>> > and so I cannot find the out files from those runs at the moment, but
>>> > I did find some comparisons from around that time.
>>> >
>>> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
>>> > run with mpirun and srun successively from inside the same Slurm job.
>>> >
>>> > mpirun namd2 macpf.conf
>>> > srun --mpi=pmi2 namd2 macpf.conf
>>> >
>>> > Firstly the mpirun output (grep'ing the interesting bits):
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns
>>> 1055.19 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns
>>> 1055.19 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns
>>> 1055.19 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns
>>> 1055.19 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns
>>> 1055.19 MB memory
>>> > WallClock: 1403.388550 CPUTime: 1403.388550 Memory: 1119.085938 MB
>>> >
>>> > Now the srun output:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns
>>> 1036.75 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns
>>> 1036.75 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns
>>> 1036.75 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns
>>> 1036.75 MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns
>>> 1036.75 MB memory
>>> > WallClock: 1230.784424 CPUTime: 1230.784424 Memory: 1100.648438 MB
>>> >
>>> >
>>> > The next two pairs are first launched using mpirun from 1.6.x and then
>>> with srun
>>> > from 1.7.3a1r29103. Again each pair inside the same Slurm job with
>>> the same inputs.
>>> >
>>> > First pair mpirun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57
>>> MB memory
>>> > WallClock: 8341.524414 CPUTime: 8341.524414 Memory: 975.015625 MB
>>> >
>>> > First pair srun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883
>>> MB memory
>>> > WallClock: 7476.643555 CPUTime: 7476.643555 Memory: 968.867188 MB
>>> >
>>> >
>>> > Second pair mpirun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527
>>> MB memory
>>> > WallClock: 7842.831543 CPUTime: 7842.831543 Memory: 1004.050781 MB
>>> >
>>> > Second pair srun:
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB
>>> memory
>>> > Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB
>>> memory
>>> > Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91
>>> MB memory
>>> > Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91
>>> MB memory
>>> > WallClock: 7522.677246 CPUTime: 7522.677246 Memory: 969.433594 MB
>>> >
>>> >
>>> > So to me it looks like (for NAMD on our system at least) that
>>> > PMI2 does seem to give better scalability.
>>> >
>>> > All the best!
>>> > Chris
>>> > - --
>>> > Christopher Samuel Senior Systems Administrator
>>> > VLSCI - Victorian Life Sciences Computation Initiative
>>> > Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
>>> > http://www.vlsci.org.au/ http://twitter.com/vlsci
>>> >
>>> > -----BEGIN PGP SIGNATURE-----
>>> > Version: GnuPG v1.4.14 (GNU/Linux)
>>> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>> >
>>> > iEYEARECAAYFAlNp28UACgkQO2KABBYQAh8hagCfewbbxUR6grg5R40GrwjtIZV0
>>> > 1KYAn2uX0yKLdOEbkHARKouzwFilaTTD
>>> > =A/Yw
>>> > -----END PGP SIGNATURE-----
>>> > _______________________________________________
>>> > devel mailing list
>>> > devel_at_[hidden]
>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> > Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/05/14697.php
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/05/14698.php
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14707.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14708.php
>>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14714.php
>