Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Paul Donohue (openmpi_at_[hidden])
Date: 2006-05-24 10:27:13

> > Since I have single-processor nodes, the obvious solution
> > would be to set slots=0 for each of my nodes, so that using 1
> > slot for every run causes the nodes to be oversubscribed.
> > However, it seems that slots=0 is treated like
> > slots=infinity, so my processes run in Aggressive Mode, and I
> > loose the ability to oversubscribe my node using two
> > independent processes.
> I'd prefer to keep the slots=0 synonymous to "infinity", if only for
> historical reasons (it's also less code to change :-) ).
Understandable. 'slots=0' mapping to 'infinity' is useful feature, I think. I only mentioned it because I figured I should give justification as to why mpi_yield_when_idle working properly was necessary (since it is not possible to duplicate its functionality by mucking with the slots value).

> > So, I tried setting '--mca mpi_yield_when_idle 1', since this
> > sounded like it was meant to force Degraded Mode. But, it
> > didn't seem to do anything - my processes still ran in
> > Aggressive Mode. I skimmed through the source code real
> > quick, and it doesn't look like mpi_yield_when_idle is ever
> > actually used.
> Are you sure? How did you test this?

I'm using OpenMPI 1.0.2 (incase it makes a difference)

$ mpirun -np 2 --hostfile test --host --mca mpi_yield_when_idle 1 --mca orte_debug 1 hostname 2>&1 | grep yield
[psd:30325] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename <template> --universe paul_at_psd:default-universe-30325 --nsreplica "0.0.0;tcp://" --gprreplica "0.0.0;tcp://" --mpi-call-yield 0
[psd:30325] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[psd:30325] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename --universe paul_at_psd:default-universe-30325 --nsreplica "0.0.0;tcp://" --gprreplica "0.0.0;tcp://" --mpi-call-yield 0

When it runs the worker processes, it passes --mpi-call-yield 0 to the workers even though I set mpi_yield_when_idle to 1

Perhaps this has something to do with it:
(lines 689-703 of orte/mca/pls/rsh/pls_rsh_module.c)
                /* set the progress engine schedule for this node.
                 * if node_slots is set to zero, then we default to
                 * NOT being oversubscribed
                if (ras_node->node_slots > 0 &&
                    opal_list_get_size(&rmaps_node->node_procs) > ras_node->node_slots) {
                    if (mca_pls_rsh_component.debug) {
                        opal_output(0, "pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (%d %d)",
                                    ras_node->node_slots, opal_list_get_size(&rmaps_node->node_procs));
                    argv[call_yield_index] = strdup("1");
                } else {
                    if (mca_pls_rsh_component.debug) {
                        opal_output(0, "pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0");
                    argv[call_yield_index] = strdup("0");

It looks like mpi_yield_when_idle is ignored and only slots are taken into account...

> It may be difficult to tell if this behavior is working properly
> because, by definition, if you're in an oversubscribed situation
> (assuming that all your processes are trying to fully utilize the CPU),
> the entire system could be running pretty slowly anyway.

In my case (fortunately? unfortunately?), it's fairly obvious when Degraded mode Aggressive mode are being used, since one process is idle (waiting for user input) while the other one is running. Even though the node is actually oversubscribed, in Degraded mode, the running process should be able to use most of the CPU since the idle process isn't doing much.

> I just did a small test: running 3 processes on a 2-way SMP. Each MPI
> process sends a short message around in a ring pattern 100 times:

I tried testing 4 processes on a 2-way SMP as well.
One pair of processes is waiting on STDIN.
The other pair of processes is running calculations.

First, I ran only the calculations without the STDIN processes - 35.5 second run time
Then I ran both pairs of processes, using slots=2 in my hostfile, and mpi_yield_when_idle=1 for both pairs - 25 minute run time
Then I ran both pairs of processes, using slots=1 in my hostfile - 48 second run time

Pretty drastic difference ;-)

> > I also noticed another bug in the scheduler:
> > hostfile:
> > A slots=2 max-slots=2
> > B slots=2 max-slots=2
> > 'mpirun -np 5' quits with an over-subscription error
> > 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever
> Yoinks; this is definitely a bug. I've filed a bug in our tracker to
> get this fixed. Thanks for reporting it.

(From the bug in the tracker:)
>>(I'm assuming that the "--hostfile hostfile" is implicit in the above
>>examples and was simply omitted for brevity)
Actually, I was using /usr/local/etc/openmpi-default-hostfile
So, yeah, '--hostfile /usr/local/etc/openmpi-default-hostfile' was implicit

> > And finally, on
> > - 11. How do I tell Open MPI to use processor and/or memory affinity?
> > It mentions that OpenMPI will automatically disable processor
> > affinity on oversubscribed nodes. When I first read it, I
> Correct.
> > made the assumption that processor affinity and Degraded Mode
> > were incompatible. However, it seems that independent
> > non-oversubscribed processes running in Degraded Mode work
> > fine with processor affinity - it's only actually
> > oversubscribed processes which have problems. A note that
> > Degraded Mode and Processor Affinity work together even
> > though Processor Affinity and oversubscription do not would be nice.
> Good point. I'll update the FAQ later today; thanks!
Sweet! It would probably be worth mentioning mpi_yield_when_idle=1 in there too - it took some digging for me to find that option
(After it's fixed, of course ;-) )

Thanks a bunch!