Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] perhaps an openmpi bug, how best to identify?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-07-07 10:29:31


On Jul 7, 2010, at 10:20 AM, Olivier Marsden wrote:

> The (7 process) code runs correctly on my workstation using mpich2 (latest
> stable version) & ifort 11.1, using intel-mpi & ifort 11.1, but
> randomly hangs the
> computer (vanilla ubuntu 9.10 kernel v. 2.6.31 ) to the point where only
> a magic
> sysrq combination can "save" me (i.e. reboot), when using
> - openmpi 1.4.2 compiled from source with gcc, ifort for mpif90
> - clustertools v. 8.2.1c distribution from sun/oracle, also based on
> openmpi 1.4.2, using sun f90
> for mpif90

Yowza. Open MPI is user space code, so it should never be able to hang the entire computer. Open MPI and MPICH2 do implement things in very different ways, so it's quite possible that we trip entirely different code paths in the same linux kernel.

Never say "never" -- it could well be an Open MPI bug. But it smells like a kernel bug...

> I am prepared to do some testing if that can help, but don't know the
> best way to identify what's going on.
> I have found no useful information in the syslog files.

Is the machine totally hung? Or is it just running really, really slowly? Try leaving some kind of slowly-monitoring process running in the background and see if it keeps running (perhaps even more slowly than before) when the machine hangs. E.g., something like a shell script that loops over sleeping for a second and then appending the output of "date" to a file. Or something like that.

My point: see if Open MPI went into some hyper-aggressive mode where it's (literally) stealing every available cycle and making the machine look hung. You might even want to try running the OMPI procs at a low priority to see if it can help alleviate the "steal all cycles" mode (if that is, indeed, what is happening).

If the machine is truly hung, then something else might be going on. Do any kernel logs report anything? Can you crank up your syslog to report *all* events, for example?

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/