Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
From: Artem Polyakov (artpol84_at_[hidden])
Date: 2014-05-07 20:22:09


2014-05-08 7:15 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:

> Take a look in opal/mca/common/pmi - we already do a bunch of #if PMI2
> stuff in there. All we are talking about doing here is:
>
> * making those selections be runtime based on an MCA param, compiling if
> PMI2 is available but selected at runtime
>
> * moving some additional functions into that code area and out of the
> individual components
>

Ok, that is pretty clear now. And will do exactly #2.
Thank you.

>
>
> On May 7, 2014, at 5:08 PM, Artem Polyakov <artpol84_at_[hidden]> wrote:
>
> I like #2 too.
> But my question was slightly different. Can we incapsulate PMI logic that
> OMPI use in common/pmi as #2 suggests but have 2 different
> implementations of this component say common/pmi and common/pmi2? I am
> asking because I have concerns that this kind of component is not supposed
> to be duplicated.
> In this case we could have one common MCA parameter and 2 components as it
> was suggested by Jeff.
>
>
> 2014-05-08 7:01 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:
>
>> The desired solution is to have the ability to select pmi-1 vs pmi-2 at
>> runtime. This can be done in two ways:
>>
>> 1. you could have separate pmi1 and pmi2 components in each framework.
>> You'd want to define only one common MCA param to direct the selection,
>> however.
>>
>> 2. you could have a single pmi component in each framework, calling code
>> in the appropriate common/pmi location. You would then need a runtime MCA
>> param to select whether pmi-1 or pmi-2 was going to be used, and have the
>> common code check before making the desired calls.
>>
>> The choice of method is left up to you. They each have their negatives.
>> If it were me, I'd probably try #2 first, assuming the codes are mostly
>> common in the individual frameworks.
>>
>>
>> On May 7, 2014, at 4:51 PM, Artem Polyakov <artpol84_at_[hidden]> wrote:
>>
>> Just reread your suggestions in our out-of-list discussion and found
>> that I misunderstand it. So no parallel PMI! Take all possible code into
>> opal/mca/common/pmi.
>> To additionally clarify what is the preferred way:
>> 1. to create one joined PMI module having a switches to decide what
>> functiononality to implement.
>> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and
>> does this fit opal/mca/common/ ideology at all?
>>
>>
>> 2014-05-08 6:44 GMT+07:00 Artem Polyakov <artpol84_at_[hidden]>:
>>
>>>
>>> 2014-05-08 5:54 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:
>>>
>>> Ummm....no, I don't think that's right. I believe we decided to instead
>>>> create the separate components, default to PMI-2 if available, print nice
>>>> error message if not, otherwise use PMI-1.
>>>>
>>>> I don't want to initialize both PMIs in parallel as most installations
>>>> won't support it.
>>>>
>>>
>>> Ok, I agree. Beside the lack of support there can be a performance hit
>>> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1
>>> since it is quite simple and local. But I didn't consider other
>>> implementations.
>>>
>>> On May 7, 2014, at 3:49 PM, Artem Polyakov <artpol84_at_[hidden]> wrote:
>>>>
>>>> We discussed with Ralph Joshuas concerns and decided to try automatic
>>>> PMI2 correctness first as it was initially intended. Here is my idea. The
>>>> universal way to decide if PMI2 is correct is to compare PMI_Init(..,
>>>> &rank, &size, ...) and PMI2_Init(.., &rank, &size, ...). Size and rank
>>>> should be equal. In this case we proceed with PMI2 finalizing PMI1.
>>>> Otherwise we finalize PMI2 and proceed with PMI1.
>>>> I need to clarify with SLURM guys if parallel initialization of both
>>>> PMIs are legal. If not - we'll do that sequentially.
>>>> In other places we'll just use the flag saying what PMI version to use.
>>>> Does that sounds reasonable?
>>>>
>>>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov <artpol84_at_[hidden]>:
>>>>
>>>>> That's a good point. There is actually a bunch of modules in ompi,
>>>>> opal and orte that has to be duplicated.
>>>>>
>>>>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>>>>>
>>>>>> +1 Sounds like a good idea - but decoupling the two and adding all
>>>>>> the right selection mojo might be a bit of a pain. There are several places
>>>>>> in OMPI where the distinction between PMI1and PMI2 is made, not only in
>>>>>> grpcomm. DB and ESS frameworks off the top of my head.
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov <artpol84_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>> Good idea :)!
>>>>>>>
>>>>>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>>>>>>
>>>>>>> Jeff actually had a useful suggestion (gasp!).He proposed that we
>>>>>>> separate the PMI-1 and PMI-2 codes into separate components so you could
>>>>>>> select them at runtime. Thus, we would build both (assuming both PMI-1 and
>>>>>>> 2 libs are found), default to PMI-1, but users could select to try PMI-2.
>>>>>>> If the PMI-2 component failed, we would emit a show_help indicating that
>>>>>>> they probably have a broken PMI-2 version and should try PMI-1.
>>>>>>>
>>>>>>> Make sense?
>>>>>>> Ralph
>>>>>>>
>>>>>>> On May 7, 2014, at 8:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Ah, I see. Sorry for the reactionary comment - but this feature
>>>>>>> falls squarely within my "jurisdiction", and we've invested a lot in
>>>>>>> improving OMPI jobstart under srun.
>>>>>>>
>>>>>>> That being said (now that I've taken some deep breaths and carefully
>>>>>>> read your original email :)), what you're proposing isn't a bad idea. I
>>>>>>> think it would be good to maybe add a "--with-pmi2" flag to configure since
>>>>>>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
>>>>>>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
>>>>>>> hack the installation.
>>>>>>>
>>>>>>>
>>>>>>> That would be a much simpler solution than what Artem proposed
>>>>>>> (off-list) where we would try PMI2 and then if it didn't work try to figure
>>>>>>> out how to fall back to PMI1. I'll add this for now, and if Artem wants to
>>>>>>> try his more automagic solution and can make it work, then we can
>>>>>>> reconsider that option.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <rhc_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Okay, then we'll just have to develop a workaround for all those
>>>>>>> Slurm releases where PMI-2 is borked :-(
>>>>>>>
>>>>>>> FWIW: I think people misunderstood my statement. I specifically did
>>>>>>> *not* propose to *lose* PMI-2 support. I suggested that we change it to
>>>>>>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
>>>>>>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
>>>>>>> stabilized, then we could reverse that policy.
>>>>>>>
>>>>>>> However, given that both you and Chris appear to prefer to keep it
>>>>>>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
>>>>>>> broken and then fall back to PMI-1.
>>>>>>>
>>>>>>>
>>>>>>> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.mlnx_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Just saw this thread, and I second Chris' observations: at scale we
>>>>>>> are seeing huge gains in jobstart performance with PMI2 over PMI1. We
>>>>>>> *CANNOT* loose this functionality. For competitive reasons, I
>>>>>>> cannot provide exact numbers, but let's say the difference is in the
>>>>>>> ballpark of a full order-of-magnitude on 20K ranks versus PMI1. PMI1 is
>>>>>>> completely unacceptable/unusable at scale. Certainly PMI2 still has scaling
>>>>>>> issues, but there is no contest between PMI1 and PMI2. We (MLNX) are
>>>>>>> actively working to resolve some of the scalability issues in PMI2.
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>> Joshua S. Ladd
>>>>>>> Staff Engineer, HPC Software
>>>>>>> Mellanox Technologies
>>>>>>>
>>>>>>> Email: joshual_at_[hidden]
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <rhc_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Interesting - how many nodes were involved? As I said, the bad
>>>>>>> scaling becomes more evident at a fairly high node count.
>>>>>>>
>>>>>>> On May 7, 2014, at 12:07 AM, Christopher Samuel <
>>>>>>> samuel_at_[hidden]> wrote:
>>>>>>>
>>>>>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> > Hash: SHA1
>>>>>>> >
>>>>>>> > Hiya Ralph,
>>>>>>> >
>>>>>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>>>>>> >
>>>>>>> >> I should have looked closer to see the numbers you posted, Chris -
>>>>>>> >> those include time for MPI wireup. So what you are seeing is that
>>>>>>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>>>>>>> >> than PMI. I suspect that PMI2 is not much better as the primary
>>>>>>> >> reason for the difference is that mpriun sends blobs, while PMI
>>>>>>> >> requires that everything b
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14716.php
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> С Уважением, Поляков Артем Юрьевич
>>>> Best regards, Artem Y. Polyakov
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14725.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14726.php
>>>>
>>>
>>>
>>>
>>> --
>>> С Уважением, Поляков Артем Юрьевич
>>> Best regards, Artem Y. Polyakov
>>>
>>
>>
>>
>> --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14728.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14729.php
>>
>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14730.php
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14731.php
>

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov