Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-05-26 12:01:26

> -----Original Message-----
> From: devel-bounces_at_[hidden]
> [mailto:devel-bounces_at_[hidden]] On Behalf Of Paul Donohue
> Sent: Wednesday, May 24, 2006 10:27 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Oversubscription/Scheduling Bug
> I'm using OpenMPI 1.0.2 (incase it makes a difference)
> $ mpirun -np 2 --hostfile test --host --mca
> mpi_yield_when_idle 1 --mca orte_debug 1 hostname 2>&1 | grep yield
> [psd:30325] pls:rsh: /usr/bin/ssh <template> orted
> --debug --bootproxy 1 --name <template> --num_procs 2
> --vpid_start 0 --nodename <template> --universe
> paul_at_psd:default-universe-30325 --nsreplica
> "0.0.0;tcp://" --gprreplica
> "0.0.0;tcp://" --mpi-call-yield 0
> [psd:30325] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [psd:30325] pls:rsh: executing: orted --debug --bootproxy 1
> --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
> --universe paul_at_psd:default-universe-30325
> --nsreplica "0.0.0;tcp://" --gprreplica
> "0.0.0;tcp://" --mpi-call-yield 0
> $
> When it runs the worker processes, it passes --mpi-call-yield
> 0 to the workers even though I set mpi_yield_when_idle to 1

This actually winds up in a comedy of errors. The end result is that
mpi_yield_when_idle *is* set to 1 in the MPI processes.

1. Strictly speaking, you're right that the rsh pls should probably not
be setting that variable when *not* oversubscribing. More specifically,
we should only set it to 1 when we are oversubscribing. But by point 3
(below), this is actually harmless.

2. The orted gets the option "--mpi-call-yield 0", but it does do the
Right thing, actually -- it only sets the MCA parameter to 1 if the
argument to --mpi-call-yield is > 0. Hence, in this case, it does not
set the MCA parameter.

3. mpirun and the orted bundle up MCA parameters from the mpirun command
line and environment and seed them in the newly-spawned processes. As
such, mpirun command line and environment MCA parameters override
anything that the orted may have set (e.g., via --mpi-call-yield). This
is actually by design.

You can see this by slightly modifying your test command -- run "env"
instead of "hostname". You'll see that the environment variable
OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on
the mpirun command line, regardless of a) whether you're oversubscribing
or not, and b) whatever is passed in through the orted.

I'm trying to think of a case where this will not be true, and I think
it's only platforms where we don't use the orted (e.g., Red Storm, where
oversubscription is not an issue).

> I tried testing 4 processes on a 2-way SMP as well.
> One pair of processes is waiting on STDIN.
> The other pair of processes is running calculations.
> First, I ran only the calculations without the STDIN
> processes - 35.5 second run time
> Then I ran both pairs of processes, using slots=2 in my
> hostfile, and mpi_yield_when_idle=1 for both pairs - 25
> minute run time
> Then I ran both pairs of processes, using slots=1 in my
> hostfile - 48 second run time

This is quite fishy. Note that the processes blocking on STDIN should
not be affected by the MPI yield setting -- the MPI yield setting is
*only* in effect when you're waiting for progress in an MPI function
(e.g., in MPI_SEND or MPI_RECV or the like). So:

- on a 2 way SMP
- if you have N processes running
- 2 of which are blocking in MPI calls
- (N-2) of which are blocking on <STDIN>

Note that Open MPI's "blocking" calls usually spin trying to make
progress. So in the above scenario, you'll have 2 MPI processes
spinning heavily and probably fully utilizing both CPUs. The other
(N-2) processes should not be a factor.

So the question is -- why does setting mpi_yield_when_idle to 1 take so
much time? I'm guessing that it's doing exactly what it's supposed to
be doing -- lots and lots of yielding (although I agree that a
difference of 48 seconds -> 25 minutes seems a bit excessive). The
constant yielding could be quite expensive. Are your 2 processes doing
a lot of very large communications with each other?

> > Good point. I'll update the FAQ later today; thanks!
> Sweet! It would probably be worth mentioning
> mpi_yield_when_idle=1 in there too - it took some digging for
> me to find that option
> (After it's fixed, of course ;-) )

Will do.

Jeff Squyres
Server Virtualization Business Unit
Cisco Systems