Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on aNehalem standalone machine?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-05-06 15:00:25


On May 5, 2010, at 7:54 PM, Douglas Guptill wrote:

> P.S. Yes, I know OpenMPI 1.2.8 is old. We have been using it for 2
> years with no apparent problems.

It ain't broke; don't fix it -- nothing wrong with that.

> When I saw comments like "machine hung" for 1.4.1,

FWIW, I find it hard to believe that Open MPI is the cause of machine hangs. Open MPI is user-level process stuff, which should generally not be able to crash Linux. If user-level processes can hang Linux, then something else is probably broken.

But also FWIW, we have found various MPI benchmarks and test applications can be *excellent* at finding underlying server / network problems. I can't think of a case offhand where Open MPI "caused" a machine to hang/crash/die/whatever that wasn't ultimately tracked down to some other root cause.

> and "data loss" for 1.3.x, I put aside thoughts of upgrading.

We definitely did have a big problem with OpenFabrics registered memory in Open MPI 1.3.0 and 1.3.1 (corrected in 1.3.2). Shame on us. :-(

But to continue the "FWIW" from above: we actually do *millions* of regression tests before Open MPI is released -- literally. All of us were convinced that 1.3.0 and 1.3.1 were ok to release; the data corruption issues caught us by surprise. Yuck. Those kinds of bugs are the worst. :-(

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/