Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
From: Artem Polyakov (artpol84_at_[hidden])
Date: 2014-05-07 23:05:30


2014-05-08 9:54 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:

>
> On May 7, 2014, at 6:15 PM, Christopher Samuel <samuel_at_[hidden]>
> wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Hi all,
> >
> > Apologies for having dropped out of the thread, night intervened here.
> ;-)
> >
> > On 08/05/14 00:45, Ralph Castain wrote:
> >
> >> Okay, then we'll just have to develop a workaround for all those
> >> Slurm releases where PMI-2 is borked :-(
> >
> > Do you know what these releases are? Are we talking 2.6.x or 14.03?
> > The 14.03 series has had a fair few rapid point releases and doesn't
> > appear to be anywhere as near as stable as 2.6 was when it came out. :-(
>
> Yeah :-(
>
> I think there was one 2.6.x that was borked, and definitely problems in
> the 14.03.x line. Can't pinpoint it for you, though.
>

The bug I experienced with abnormal OMPI termination persist starting from
2.6.3 till latest slurm release. It may appear earlier - I didn't check.
However SLURM gyus didn't confirm that it's a bug acually. Things will get
clear after 2 weeks when the person who maintains the code will review the
patch. But I am pretty sure thats a bug.

Refer to this thread
http://thread.gmane.org/gmane.comp.distributed.slurm.devel/5213.

>
> >
> >> FWIW: I think people misunderstood my statement. I specifically
> >> did *not* propose to *lose* PMI-2 support. I suggested that we
> >> change it to "on-by-request" instead of the current "on-by-default"
> >> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once
> >> the Slurm implementation stabilized, then we could reverse that
> >> policy.
> >>
> >> However, given that both you and Chris appear to prefer to keep it
> >> "on-by-default", we'll see if we can find a way to detect that
> >> PMI-2 is broken and then fall back to PMI-1.
> >
> > My intention was to provide the data that led us to want PMI2, but if
> > configure had an option to enable PMI2 by default so that only those
> > who requested it got it then I'd be more than happy - we'd just add it
> > to our script to build it.
>
> Sounds good. I'm going to have to dig deeper into those numbers, though,
> as they don't entirely add up to me. Once the job gets launched, the launch
> method itself should have no bearing on computational speed - IF all things
> are equal. In other words, if the process layout is the same, and the
> binding pattern is the same, then computational speed should be roughly
> equivalent regardless of how the procs were started.
>
> My guess is that your data might indicate a difference in the layout
> and/or binding pattern as opposed to PMI2 vs mpirun. At the scale you
> mention later in the thread (only 70 nodes x 16 ppn), the difference in
> launch timing would be zilch. So I'm betting you would find (upon further
> exploration) that (a) you might not have been binding processes when
> launching by mpirun, since we didn't bind by default until the 1.8 series,
> but were binding under direct srun launch, and (b) your process mapping
> would quite likely be different as we default to byslot mapping, and I
> believe srun defaults to bynode?
>
> Might be worth another comparison run when someone has time.
>
>
> >
> > All the best!
> > Chris
> > - --
> > Christopher Samuel Senior Systems Administrator
> > VLSCI - Victorian Life Sciences Computation Initiative
> > Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
> > http://www.vlsci.org.au/ http://twitter.com/vlsci
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.14 (GNU/Linux)
> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> >
> > iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP
> > 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt
> > =OvH4
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14733.php
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14738.php
>

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov