Sorry for the delays in replying.
The central problem is that Open MPI is much more aggressive about its
message passing progress than LAM is -- it simply wasn't designed to
share well as a mechanism to get as high performance as possible.
mpi_yield_when_idle is most helpful only for certain transports that
actively use our event engine, such as the TCP device. Since you're
using the LAM sysv RPI, I assume you're using the TCP and shared
memory devices in OMPI, right? If you're using infiniband, for
example, the event engine is not called much because IB has its own
progression engine that is unrelated to OMPI's (and therefore we don't
invoke OMPI's much).
mpi_yield_when_idle is also only helpful if you're going into the MPI
layer often and making message passing progress (i.e., OMPI's event
engine is actively being invoked). Is this true for your application?
If mpi_yield_when_idle really doesn't help much, you may consider
sprinkling calls to sched_yield() in your codes to force the process
to yield the processor.
On Apr 4, 2008, at 2:30 AM, Lars Andersson wrote:
> I'm just in the progress of moving our application from LAM/MPI to
> OpenMPI, mainly because OpenMPI makes it easier for a user to run
> multiple jobs(MPI universa) simultaneously. This is useful if a user
> wants to run smaller experiments without disturbing a large experiment
> running in the background). I've been evaluation the performance using
> a simple test, running on a hetrogenous cluster of 2 x dual core
> Opteron machines, a couple of dual core P4 Xeon machines and a 8 core
> Core2 machine. The main structure of the application is a master rank
> distributing jobs packages to the rest of the ranks and collecting the
> results. We don't use any fancy MPI features but rather see it as an
> efficient low-level tool for broadcasting and transferring data.
> When a single user runs a job (fully subscribed nodes, but not
> oversubscribed, i.e one process per cpu-core) on an otherwise unloaded
> cluster both LAM/MPI and OpenMPI average runtimes of about 1m33s
> (OpenMPI has a slightly lower average).
> When I start the same job simultaneously as two different users (thus
> oversubscribing the nodes 2x) under LAM/MPI, the two jobs finish as an
> average time of about 3m, thus scaling very well (we use the -ssi rpi
> sysv option to mpirun under LAM/MPI to avoid busy waiting).
> When running the same second experiment under OpenMPI, the average
> runtime jumps up to about 3m30s, with runs occasionally taking more
> than 4 minutes to complete. I do use the "--mca mpi_yield_when_idle 1"
> option to mpirun, but it doesn't seem to make any difference. I've
> also tried setting the environment variable
> OMPI_MCA_mpi_yield_when_idle=1, but still no change. ompi_info says:
> ompi_info --param all all | grep yield
> MCA mpi: parameter "mpi_yield_when_idle" (current
> value: "1")
> The cluster is used for various tasks, running MPI applications as
> well as non-MPI applications, so we would like to avoid spending too
> much cycles on busy-waiting. Any ideas on how to tweak OpenMPI to get
> better performance and more cooperative behavior in this case would be
> greatly appreciated.
> users mailing list