2014-05-08 9:54 GMT+07:00 Ralph Castain <rhc_at_[hidden]>:
> On May 7, 2014, at 6:15 PM, Christopher Samuel <samuel_at_[hidden]>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > Hi all,
> > Apologies for having dropped out of the thread, night intervened here.
> > On 08/05/14 00:45, Ralph Castain wrote:
> >> Okay, then we'll just have to develop a workaround for all those
> >> Slurm releases where PMI-2 is borked :-(
> > Do you know what these releases are? Are we talking 2.6.x or 14.03?
> > The 14.03 series has had a fair few rapid point releases and doesn't
> > appear to be anywhere as near as stable as 2.6 was when it came out. :-(
> Yeah :-(
> I think there was one 2.6.x that was borked, and definitely problems in
> the 14.03.x line. Can't pinpoint it for you, though.
The bug I experienced with abnormal OMPI termination persist starting from
2.6.3 till latest slurm release. It may appear earlier - I didn't check.
However SLURM gyus didn't confirm that it's a bug acually. Things will get
clear after 2 weeks when the person who maintains the code will review the
patch. But I am pretty sure thats a bug.
Refer to this thread
> >> FWIW: I think people misunderstood my statement. I specifically
> >> did *not* propose to *lose* PMI-2 support. I suggested that we
> >> change it to "on-by-request" instead of the current "on-by-default"
> >> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once
> >> the Slurm implementation stabilized, then we could reverse that
> >> policy.
> >> However, given that both you and Chris appear to prefer to keep it
> >> "on-by-default", we'll see if we can find a way to detect that
> >> PMI-2 is broken and then fall back to PMI-1.
> > My intention was to provide the data that led us to want PMI2, but if
> > configure had an option to enable PMI2 by default so that only those
> > who requested it got it then I'd be more than happy - we'd just add it
> > to our script to build it.
> Sounds good. I'm going to have to dig deeper into those numbers, though,
> as they don't entirely add up to me. Once the job gets launched, the launch
> method itself should have no bearing on computational speed - IF all things
> are equal. In other words, if the process layout is the same, and the
> binding pattern is the same, then computational speed should be roughly
> equivalent regardless of how the procs were started.
> My guess is that your data might indicate a difference in the layout
> and/or binding pattern as opposed to PMI2 vs mpirun. At the scale you
> mention later in the thread (only 70 nodes x 16 ppn), the difference in
> launch timing would be zilch. So I'm betting you would find (upon further
> exploration) that (a) you might not have been binding processes when
> launching by mpirun, since we didn't bind by default until the 1.8 series,
> but were binding under direct srun launch, and (b) your process mapping
> would quite likely be different as we default to byslot mapping, and I
> believe srun defaults to bynode?
> Might be worth another comparison run when someone has time.
> > All the best!
> > Chris
> > - --
> > Christopher Samuel Senior Systems Administrator
> > VLSCI - Victorian Life Sciences Computation Initiative
> > Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
> > http://www.vlsci.org.au/ http://twitter.com/vlsci
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.14 (GNU/Linux)
> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> > iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP
> > 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt
> > =OvH4
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> devel mailing list
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
Ð¡ Ð£Ð²Ð°Ð¶ÐµÐ½Ð¸ÐµÐ¼, ÐÐ¾Ð»ÑÐºÐ¾Ð² ÐÑÑÐµÐ¼ Ð®ÑÑÐµÐ²Ð¸Ñ
Best regards, Artem Y. Polyakov