Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
From: Artem Polyakov (artpol84_at_[hidden])
Date: 2014-05-07 12:10:49


That's a good point. There is actually a bunch of modules in ompi, opal and
orte that has to be duplicated.

среда, 7 мая 2014 г. пользователь Joshua Ladd написал:

> +1 Sounds like a good idea - but decoupling the two and adding all the
> right selection mojo might be a bit of a pain. There are several places in
> OMPI where the distinction between PMI1and PMI2 is made, not only in
> grpcomm. DB and ESS frameworks off the top of my head.
>
> Josh
>
>
> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov <artpol84_at_[hidden]>wrote:
>
>> Good idea :)!
>>
>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>
>> Jeff actually had a useful suggestion (gasp!).He proposed that we
>> separate the PMI-1 and PMI-2 codes into separate components so you could
>> select them at runtime. Thus, we would build both (assuming both PMI-1 and
>> 2 libs are found), default to PMI-1, but users could select to try PMI-2.
>> If the PMI-2 component failed, we would emit a show_help indicating that
>> they probably have a broken PMI-2 version and should try PMI-1.
>>
>> Make sense?
>> Ralph
>>
>> On May 7, 2014, at 8:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>
>> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>
>> Ah, I see. Sorry for the reactionary comment - but this feature falls
>> squarely within my "jurisdiction", and we've invested a lot in improving
>> OMPI jobstart under srun.
>>
>> That being said (now that I've taken some deep breaths and carefully read
>> your original email :)), what you're proposing isn't a bad idea. I think it
>> would be good to maybe add a "--with-pmi2" flag to configure since
>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
>> hack the installation.
>>
>>
>> That would be a much simpler solution than what Artem proposed (off-list)
>> where we would try PMI2 and then if it didn't work try to figure out how to
>> fall back to PMI1. I'll add this for now, and if Artem wants to try his
>> more automagic solution and can make it work, then we can reconsider that
>> option.
>>
>> Thanks
>> Ralph
>>
>>
>> Josh
>>
>>
>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> Okay, then we'll just have to develop a workaround for all those Slurm
>> releases where PMI-2 is borked :-(
>>
>> FWIW: I think people misunderstood my statement. I specifically did *not*
>> propose to *lose* PMI-2 support. I suggested that we change it to
>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
>> stabilized, then we could reverse that policy.
>>
>> However, given that both you and Chris appear to prefer to keep it
>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
>> broken and then fall back to PMI-1.
>>
>>
>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>
>> Just saw this thread, and I second Chris' observations: at scale we are
>> seeing huge gains in jobstart performance with PMI2 over PMI1. We
>> *CANNOT* loose this functionality. For competitive reasons, I cannot
>> provide exact numbers, but let's say the difference is in the ballpark of a
>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
>> but there is no contest between PMI1 and PMI2. We (MLNX) are actively
>> working to resolve some of the scalability issues in PMI2.
>>
>> Josh
>>
>> Joshua S. Ladd
>> Staff Engineer, HPC Software
>> Mellanox Technologies
>>
>> Email: joshual_at_[hidden]
>>
>>
>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> Interesting - how many nodes were involved? As I said, the bad scaling
>> becomes more evident at a fairly high node count.
>>
>> On May 7, 2014, at 12:07 AM, Christopher Samuel <samuel_at_[hidden]>
>> wrote:
>>
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA1
>> >
>> > Hiya Ralph,
>> >
>> > On 07/05/14 14:49, Ralph Castain wrote:
>> >
>> >> I should have looked closer to see the numbers you posted, Chris -
>> >> those include time for MPI wireup. So what you are seeing is that
>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>> >> than PMI. I suspect that PMI2 is not much better as the primary
>> >> reason for the difference is that mpriun sends blobs, while PMI
>> >> requires that everything b
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14716.php
>>
>
>