On Mon, Dec 5, 2011 at 16:12, Ralph Castain <rhc_at_[hidden]> wrote:
> Sounds like we should be setting this value when starting the process - yes?
> If so, what is the "good" value, and how do we compute it?
I've also been just looking at this for the past few days. What I came
up with is a small script psm_shctx which sets the envvar then execs
the MPI binary and is inserted between mpirun and the MPI binary:
mpirun psm_shctx my_mpi_app
Of course the same effect can be obtained if the orted would set the
envvar before starting the process. There is however a problem:
deciding how many contexts to use. For max. performance, one should
use a ratio of 1:1 between MPI ranks and contexts; the highest ratio
possible (but with lowest performance) is 4 MPI ranks per context;
another restriction is that each job should have at least 1 context.
F.e. on AMD cluster nodes with 4 CPUs of 12 cores (so total of 48
cores) one gets 16 contexts; assigning all 16 contexts to 48 ranks
would mean a ratio of 1:3 but this can only apply if allocation of
cores is done in multiples of 4; with a less advantageous allocation
strategy more contexts are lost due to rounding up. At the extreme, if
there's only one rank per job, there can only be maximum 16 jobs -
using all 16 contexts and the rest of 32 cores have to remain idle or
be used for other jobs that don't require communication over
There is a further issue though: MPI-2 dynamic creation of processes -
if it's not known how many ranks there will be, I guess one should use
the highest context sharing ratio (1:4) to be on the safe side.
I've found a mention of this envvar being handled in the changelog for
MVAPICH2 1.4.1 - maybe that can serve as source of inspiration ? (but
I haven't looked at it...)
Hope this helps,