Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
From: Artem Polyakov (artpol84_at_[hidden])
Date: 2014-05-07 19:44:22


2014-05-08 5:54 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:

> Ummm....no, I don't think that's right. I believe we decided to instead
> create the separate components, default to PMI-2 if available, print nice
> error message if not, otherwise use PMI-1.
>
> I don't want to initialize both PMIs in parallel as most installations
> won't support it.
>

Ok, I agree. Beside the lack of support there can be a performance hit
caused by PMI1 initialization at scale. This is not a case of SLURM PMI1
since it is quite simple and local. But I didn't consider other
implementations.

On May 7, 2014, at 3:49 PM, Artem Polyakov <artpol84_at_[hidden]> wrote:
>
> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2
> correctness first as it was initially intended. Here is my idea. The
> universal way to decide if PMI2 is correct is to compare PMI_Init(..,
> &rank, &size, ...) and PMI2_Init(.., &rank, &size, ...). Size and rank
> should be equal. In this case we proceed with PMI2 finalizing PMI1.
> Otherwise we finalize PMI2 and proceed with PMI1.
> I need to clarify with SLURM guys if parallel initialization of both PMIs
> are legal. If not - we'll do that sequentially.
> In other places we'll just use the flag saying what PMI version to use.
> Does that sounds reasonable?
>
> 2014-05-07 23:10 GMT+07:00 Artem Polyakov <artpol84_at_[hidden]>:
>
>> That's a good point. There is actually a bunch of modules in ompi, opal
>> and orte that has to be duplicated.
>>
>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>>
>>> +1 Sounds like a good idea - but decoupling the two and adding all the
>>> right selection mojo might be a bit of a pain. There are several places in
>>> OMPI where the distinction between PMI1and PMI2 is made, not only in
>>> grpcomm. DB and ESS frameworks off the top of my head.
>>>
>>> Josh
>>>
>>>
>>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov <artpol84_at_[hidden]>
>>> wrote:
>>>
>>>> Good idea :)!
>>>>
>>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>>>
>>>> Jeff actually had a useful suggestion (gasp!).He proposed that we
>>>> separate the PMI-1 and PMI-2 codes into separate components so you could
>>>> select them at runtime. Thus, we would build both (assuming both PMI-1 and
>>>> 2 libs are found), default to PMI-1, but users could select to try PMI-2.
>>>> If the PMI-2 component failed, we would emit a show_help indicating that
>>>> they probably have a broken PMI-2 version and should try PMI-1.
>>>>
>>>> Make sense?
>>>> Ralph
>>>>
>>>> On May 7, 2014, at 8:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>>
>>>> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>>>
>>>> Ah, I see. Sorry for the reactionary comment - but this feature falls
>>>> squarely within my "jurisdiction", and we've invested a lot in improving
>>>> OMPI jobstart under srun.
>>>>
>>>> That being said (now that I've taken some deep breaths and carefully
>>>> read your original email :)), what you're proposing isn't a bad idea. I
>>>> think it would be good to maybe add a "--with-pmi2" flag to configure since
>>>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
>>>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
>>>> hack the installation.
>>>>
>>>>
>>>> That would be a much simpler solution than what Artem proposed
>>>> (off-list) where we would try PMI2 and then if it didn't work try to figure
>>>> out how to fall back to PMI1. I'll add this for now, and if Artem wants to
>>>> try his more automagic solution and can make it work, then we can
>>>> reconsider that option.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>>
>>>> Josh
>>>>
>>>>
>>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <rhc_at_[hidden]>
>>>> wrote:
>>>>
>>>> Okay, then we'll just have to develop a workaround for all those Slurm
>>>> releases where PMI-2 is borked :-(
>>>>
>>>> FWIW: I think people misunderstood my statement. I specifically did
>>>> *not* propose to *lose* PMI-2 support. I suggested that we change it to
>>>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
>>>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
>>>> stabilized, then we could reverse that policy.
>>>>
>>>> However, given that both you and Chris appear to prefer to keep it
>>>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
>>>> broken and then fall back to PMI-1.
>>>>
>>>>
>>>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>>>
>>>> Just saw this thread, and I second Chris' observations: at scale we are
>>>> seeing huge gains in jobstart performance with PMI2 over PMI1. We
>>>> *CANNOT* loose this functionality. For competitive reasons, I cannot
>>>> provide exact numbers, but let's say the difference is in the ballpark of a
>>>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
>>>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
>>>> but there is no contest between PMI1 and PMI2. We (MLNX) are actively
>>>> working to resolve some of the scalability issues in PMI2.
>>>>
>>>> Josh
>>>>
>>>> Joshua S. Ladd
>>>> Staff Engineer, HPC Software
>>>> Mellanox Technologies
>>>>
>>>> Email: joshual_at_[hidden]
>>>>
>>>>
>>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>> Interesting - how many nodes were involved? As I said, the bad scaling
>>>> becomes more evident at a fairly high node count.
>>>>
>>>> On May 7, 2014, at 12:07 AM, Christopher Samuel <samuel_at_[hidden]>
>>>> wrote:
>>>>
>>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>>> > Hash: SHA1
>>>> >
>>>> > Hiya Ralph,
>>>> >
>>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>>> >
>>>> >> I should have looked closer to see the numbers you posted, Chris -
>>>> >> those include time for MPI wireup. So what you are seeing is that
>>>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>>>> >> than PMI. I suspect that PMI2 is not much better as the primary
>>>> >> reason for the difference is that mpriun sends blobs, while PMI
>>>> >> requires that everything b
>>>>
>>>> _______________________________________________
>>>>
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14716.php
>>>>
>>>
>>>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14725.php
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14726.php
>

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov