Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-05-07 20:15:27


Take a look in opal/mca/common/pmi - we already do a bunch of #if PMI2 stuff in there. All we are talking about doing here is:

* making those selections be runtime based on an MCA param, compiling if PMI2 is available but selected at runtime

* moving some additional functions into that code area and out of the individual components

On May 7, 2014, at 5:08 PM, Artem Polyakov <artpol84_at_[hidden]> wrote:

> I like #2 too.
> But my question was slightly different. Can we incapsulate PMI logic that OMPI use in common/pmi as #2 suggests but have 2 different implementations of this component say common/pmi and common/pmi2? I am asking because I have concerns that this kind of component is not supposed to be duplicated.
> In this case we could have one common MCA parameter and 2 components as it was suggested by Jeff.
>
>
> 2014-05-08 7:01 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:
> The desired solution is to have the ability to select pmi-1 vs pmi-2 at runtime. This can be done in two ways:
>
> 1. you could have separate pmi1 and pmi2 components in each framework. You'd want to define only one common MCA param to direct the selection, however.
>
> 2. you could have a single pmi component in each framework, calling code in the appropriate common/pmi location. You would then need a runtime MCA param to select whether pmi-1 or pmi-2 was going to be used, and have the common code check before making the desired calls.
>
> The choice of method is left up to you. They each have their negatives. If it were me, I'd probably try #2 first, assuming the codes are mostly common in the individual frameworks.
>
>
> On May 7, 2014, at 4:51 PM, Artem Polyakov <artpol84_at_[hidden]> wrote:
>
>> Just reread your suggestions in our out-of-list discussion and found that I misunderstand it. So no parallel PMI! Take all possible code into opal/mca/common/pmi.
>> To additionally clarify what is the preferred way:
>> 1. to create one joined PMI module having a switches to decide what functiononality to implement.
>> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and does this fit opal/mca/common/ ideology at all?
>>
>>
>> 2014-05-08 6:44 GMT+07:00 Artem Polyakov <artpol84_at_[hidden]>:
>>
>> 2014-05-08 5:54 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:
>>
>> Ummm....no, I don't think that's right. I believe we decided to instead create the separate components, default to PMI-2 if available, print nice error message if not, otherwise use PMI-1.
>>
>> I don't want to initialize both PMIs in parallel as most installations won't support it.
>>
>> Ok, I agree. Beside the lack of support there can be a performance hit caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 since it is quite simple and local. But I didn't consider other implementations.
>>
>> On May 7, 2014, at 3:49 PM, Artem Polyakov <artpol84_at_[hidden]> wrote:
>>
>>> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 correctness first as it was initially intended. Here is my idea. The universal way to decide if PMI2 is correct is to compare PMI_Init(.., &rank, &size, ...) and PMI2_Init(.., &rank, &size, ...). Size and rank should be equal. In this case we proceed with PMI2 finalizing PMI1. Otherwise we finalize PMI2 and proceed with PMI1.
>>> I need to clarify with SLURM guys if parallel initialization of both PMIs are legal. If not - we'll do that sequentially.
>>> In other places we'll just use the flag saying what PMI version to use.
>>> Does that sounds reasonable?
>>>
>>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov <artpol84_at_[hidden]>:
>>> That's a good point. There is actually a bunch of modules in ompi, opal and orte that has to be duplicated.
>>>
>>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>>> +1 Sounds like a good idea - but decoupling the two and adding all the right selection mojo might be a bit of a pain. There are several places in OMPI where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB and ESS frameworks off the top of my head.
>>>
>>> Josh
>>>
>>>
>>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov <artpol84_at_[hidden]> wrote:
>>> Good idea :)!
>>>
>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>>
>>> Jeff actually had a useful suggestion (gasp!).He proposed that we separate the PMI-1 and PMI-2 codes into separate components so you could select them at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 component failed, we would emit a show_help indicating that they probably have a broken PMI-2 version and should try PMI-1.
>>>
>>> Make sense?
>>> Ralph
>>>
>>> On May 7, 2014, at 8:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>>
>>>> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>>>
>>>>> Ah, I see. Sorry for the reactionary comment - but this feature falls squarely within my "jurisdiction", and we've invested a lot in improving OMPI jobstart under srun.
>>>>>
>>>>> That being said (now that I've taken some deep breaths and carefully read your original email :)), what you're proposing isn't a bad idea. I think it would be good to maybe add a "--with-pmi2" flag to configure since "--with-pmi" automagically uses PMI2 if it finds the header and lib. This way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or hack the installation.
>>>>
>>>> That would be a much simpler solution than what Artem proposed (off-list) where we would try PMI2 and then if it didn't work try to figure out how to fall back to PMI1. I'll add this for now, and if Artem wants to try his more automagic solution and can make it work, then we can reconsider that option.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>>>
>>>>> Josh
>>>>>
>>>>>
>>>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>> Okay, then we'll just have to develop a workaround for all those Slurm releases where PMI-2 is borked :-(
>>>>>
>>>>> FWIW: I think people misunderstood my statement. I specifically did *not* propose to *lose* PMI-2 support. I suggested that we change it to "on-by-request" instead of the current "on-by-default" so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation stabilized, then we could reverse that policy.
>>>>>
>>>>> However, given that both you and Chris appear to prefer to keep it "on-by-default", we'll see if we can find a way to detect that PMI-2 is broken and then fall back to PMI-1.
>>>>>
>>>>>
>>>>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
>>>>>
>>>>>> Just saw this thread, and I second Chris' observations: at scale we are seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT loose this functionality. For competitive reasons, I cannot provide exact numbers, but let's say the difference is in the ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but there is no contest between PMI1 and PMI2. We (MLNX) are actively working to resolve some of the scalability issues in PMI2.
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>> Joshua S. Ladd
>>>>>> Staff Engineer, HPC Software
>>>>>> Mellanox Technologies
>>>>>>
>>>>>> Email: joshual_at_[hidden]
>>>>>>
>>>>>>
>>>>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>> Interesting - how many nodes were involved? As I said, the bad scaling becomes more evident at a fairly high node count.
>>>>>>
>>>>>> On May 7, 2014, at 12:07 AM, Christopher Samuel <samuel_at_[hidden]> wrote:
>>>>>>
>>>>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> > Hash: SHA1
>>>>>> >
>>>>>> > Hiya Ralph,
>>>>>> >
>>>>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>>>>> >
>>>>>> >> I should have looked closer to see the numbers you posted, Chris -
>>>>>> >> those include time for MPI wireup. So what you are seeing is that
>>>>>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>>>>>> >> than PMI. I suspect that PMI2 is not much better as the primary
>>>>>> >> reason for the difference is that mpriun sends blobs, while PMI
>>>>>> >> requires that everything b
>>> _______________________________________________
>>>
>>> devel mailing list
>>> devel_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14716.php
>>>
>>>
>>>
>>>
>>> --
>>> С Уважением, Поляков Артем Юрьевич
>>> Best regards, Artem Y. Polyakov
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14725.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14726.php
>>
>>
>>
>> --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>>
>>
>>
>> --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14728.php
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14729.php
>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14730.php