This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
Good idea :)!
ÑÑÐµÐ´Ð°, 7 Ð¼Ð°Ñ 2014 Ð³. Ð¿Ð¾Ð»ÑÐ·Ð¾Ð²Ð°ÑÐµÐ»Ñ Ralph Castain Ð½Ð°Ð¿Ð¸ÑÐ°Ð»:
> Jeff actually had a useful suggestion (gasp!).He proposed that we separate
> the PMI-1 and PMI-2 codes into separate components so you could select them
> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are
> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2
> component failed, we would emit a show_help indicating that they probably
> have a broken PMI-2 version and should try PMI-1.
> Make sense?
> On May 7, 2014, at 8:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> On May 7, 2014, at 7:56 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
> Ah, I see. Sorry for the reactionary comment - but this feature falls
> squarely within my "jurisdiction", and we've invested a lot in improving
> OMPI jobstart under srun.
> That being said (now that I've taken some deep breaths and carefully read
> your original email :)), what you're proposing isn't a bad idea. I think it
> would be good to maybe add a "--with-pmi2" flag to configure since
> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
> hack the installation.
> That would be a much simpler solution than what Artem proposed (off-list)
> where we would try PMI2 and then if it didn't work try to figure out how to
> fall back to PMI1. I'll add this for now, and if Artem wants to try his
> more automagic solution and can make it work, then we can reconsider that
> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> Okay, then we'll just have to develop a workaround for all those Slurm
> releases where PMI-2 is borked :-(
> FWIW: I think people misunderstood my statement. I specifically did *not*
> propose to *lose* PMI-2 support. I suggested that we change it to
> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
> stabilized, then we could reverse that policy.
> However, given that both you and Chris appear to prefer to keep it
> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
> broken and then fall back to PMI-1.
> On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.mlnx_at_[hidden]> wrote:
> Just saw this thread, and I second Chris' observations: at scale we are
> seeing huge gains in jobstart performance with PMI2 over PMI1. We *CANNOT*loose this functionality. For competitive reasons, I cannot provide exact
> numbers, but let's say the difference is in the ballpark of a full
> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
> but there is no contest between PMI1 and PMI2. We (MLNX) are actively
> working to resolve some of the scalability issues in PMI2.
> Joshua S. Ladd
> Staff Engineer, HPC Software
> Mellanox Technologies
> Email: joshual_at_[hidden]
> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> Interesting - how many nodes were involved? As I said, the bad scaling
> becomes more evident at a fairly high node count.
> On May 7, 2014, at 12:07 AM, Christopher Samuel <samuel_at_[hidden]>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > Hiya Ralph,
> > On 07/05/14 14:49, Ralph Castain wrote:
> >> I should have looked closer to see the numbers you posted, Chris -
> >> those include time for MPI wireup. So what you are seeing is that
> >> mpirun is much more efficient at exchanging the MPI endpoint info
> >> than PMI. I suspect that PMI2 is not much better as the primary
> >> reason for the difference is that mpriun sends blobs, while PMI
> >> requires that everything b
Ð¡ Ð£Ð²Ð°Ð¶ÐµÐ½Ð¸ÐµÐ¼, ÐÐ¾Ð»ÑÐºÐ¾Ð² ÐÑÑÐµÐ¼ Ð®ÑÑÐµÐ²Ð¸Ñ
Best regards, Artem Y. Polyakov