There was some discussion at yesterday's tutorial about ORTE scalability and
where bottlenecks might be occurring. I spent some time last night
identifying key information required to answer those questions. I'll be
presenting a slide today showing the key timing points that we would need
I have also begun (this morning) to instrument the trunk to measure those
times. Some really quick results, all done on a Mac G5:
1. It takes about 3 milliseconds to setup a job (i.e., go through the RDS,
RAS, and RMAPS frameworks, setup the stage gate triggers, prep io
forwarding, etc. - everything before we actually launch). This bounces
around a lot (I'm just using gettimeofday), but seems to have at most a
slight dependence on the number of processes being launched.
2. It takes roughly 1-3 milliseconds to execute the compound command that
registers all of the data from an MPI process (i.e., the data sent at the
STG1 stage gate). This is the time required on the HNP to process the
command - it doesn't include any time spent actually communicating. It does,
however, include time spent packing/unpacking buffers. My tests were all
done on a local node for now, so the OOB just passes the buffer across from
send to receive. As you would expect, since the info being stored is only
from one process, there is no observable scaling dependence here.
3. The time from start of MPI_Init until we do the registry command is
taking about 12-20 milliseconds - again, as expected, no observable scaling
There will have to be quite a few tests, of course, but I don't expect the
first two values to change very much (obviously, they will depend on the
hardware on the head node). I'll keep you posted as we learn more.