Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-05-24 08:14:57


Paul --

Many thanks for your detailed report. I apparently missed a whole
boatload of e-mails on 2 May due to a problem with my mail client. Deep
apologies for missing this mail! :-(

More information below.

> -----Original Message-----
> From: devel-bounces_at_[hidden]
> [mailto:devel-bounces_at_[hidden]] On Behalf Of Paul Donohue
> Sent: Friday, May 05, 2006 10:47 PM
> To: devel_at_[hidden]
> Subject: [OMPI devel] Oversubscription/Scheduling Bug
>
> I would like to be able to start a non-oversubscribed run of
> a program in OpenMPI as if it were oversubscribed, so that
> the processes run in Degraded Mode, such that I have the
> option to start an additional simultaneous run on the same
> nodes if necessary.
> (Basically, I have a program that will ask for some data, run
> for a while, then print some results, then stop and ask for
> more data. It takes some time to collect and input the
> additional data, so I would like to be able to start another
> instance of the program which can be running while i'm
> inputting data to the first instance, and can be inputting
> while the first instance is running).
>
> Since I have single-processor nodes, the obvious solution
> would be to set slots=0 for each of my nodes, so that using 1
> slot for every run causes the nodes to be oversubscribed.
> However, it seems that slots=0 is treated like
> slots=infinity, so my processes run in Aggressive Mode, and I
> loose the ability to oversubscribe my node using two
> independent processes.

I'd prefer to keep the slots=0 synonymous to "infinity", if only for
historical reasons (it's also less code to change :-) ).
 
> So, I tried setting '--mca mpi_yield_when_idle 1', since this
> sounded like it was meant to force Degraded Mode. But, it
> didn't seem to do anything - my processes still ran in
> Aggressive Mode. I skimmed through the source code real
> quick, and it doesn't look like mpi_yield_when_idle is ever
> actually used.

Are you sure? How did you test this?

I just did a few tests and it seems to work fine for me. The MCA param
"mpi_yield_when_idle" is actually used within the OPAL layer (the name
is somewhat of an abstraction break -- it reflects the fact that the
progression engine used to be up in the MPI layer; it got put in OPAL
when the entire source code tree was split into OPAL, ORTE, and OMPI) in
opal/runtime/opal_progress.c.

You can check for whether this param is set or not by using the
mpi_show_mca_params MCA parameter. Setting this parameter to 1 will
make all MPI processes display the current values for their MCA
parameters to stderr. For example:

-----
shell% mpirun -np 1 --mca mpi_show_mca_params 1 hello | & grep yield
[foo.example.com:23206] mpi_yield_when_idle=0
shell% mpirun -np 1 --mca mpi_yield_when_idle 1 --mca
mpi_show_mca_params 1 hello | & grep yield
[foo.example.com:23213] mpi_yield_when_idle=1
-----

It may be difficult to tell if this behavior is working properly
because, by definition, if you're in an oversubscribed situation
(assuming that all your processes are trying to fully utilize the CPU),
the entire system could be running pretty slowly anyway.

The difference between aggressive and degraded mode is that we call
yield() in the middle of tight progression loops in OMPI. Hence, if
you're oversubscribed, this actually gives other processes a chance of
being scheduled / run by the OS. For example, if you oversubscribe and
don't have this param set, because OMPI uses tight repetitive loops for
progression, you will typically see one process completely hogging the
CPU for a long, long time before the OS finally lets another be
scheduled.

I just did a small test: running 3 processes on a 2-way SMP. Each MPI
process sends a short message around in a ring pattern 100 times:

- mpi_yield_when_idle=1 : 1.4 seconds running time
- mpi_tield_when_idle=0 : 22.8 seconds running time

So it can make a big difference. But don't expect it to completely
mitigate the effects of oversubscription.

> I also noticed another bug in the scheduler:
> hostfile:
> A slots=2 max-slots=2
> B slots=2 max-slots=2
> 'mpirun -np 5' quits with an over-subscription error
> 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever

Yoinks; this is definitely a bug. I've filed a bug in our tracker to
get this fixed. Thanks for reporting it.
 
> And finally, on http://www.open-mpi.org/faq/?category=tuning
> - 11. How do I tell Open MPI to use processor and/or memory affinity?
> It mentions that OpenMPI will automatically disable processor
> affinity on oversubscribed nodes. When I first read it, I

Correct.

> made the assumption that processor affinity and Degraded Mode
> were incompatible. However, it seems that independent
> non-oversubscribed processes running in Degraded Mode work
> fine with processor affinity - it's only actually
> oversubscribed processes which have problems. A note that
> Degraded Mode and Processor Affinity work together even
> though Processor Affinity and oversubscription do not would be nice.

Good point. I'll update the FAQ later today; thanks!

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems