Hi all -
LANL had an internal meeting yesterday trying to classify a number of
issues we're having with the run-time environment for Open MPI and how
to best prioritize team resources. We thought it would be good to
both share the list (with priorities) with the group and to ask the
group if there were other issues that need to be addressed (either
short or long term). We've categorized the issues as performance
related, robustness, and feature / platform support. The numbers are
the current priority on our list, and items within a category are
sorted by priority.
PERFORMANCE:
5) 50% scale factor in process startup
Start-up of non-MPI jobs has a strange bend in the timing curve
when the number of processes we are trying to start is greater than
or equal to 50% of the current allocation. It appears that
starting a 16 process (1 ppn) job takes longer if there are 32
nodes in the allocation than if there are 64 nodes in the
allocation.
Assigned to: Galen
6) MPI_INIT startup timings
In addition to seeming to suffer from the same 50% issue as the
previous issue, there also appears to be a number of places in
MPI_INIT where we spend a considerable amount of time when at
scale, leading to startup times much worse than LA-MPI or
MPIEXEC/MVAPICH.
Assigned to: Galen
ROBUSTNESS:
1) MPI process aborting issue
This is the orted spin, MPI processes don't die, etc. issue that
occurs when some process dies unexpectedly. Ralph has already sent
a detailed e-mail to devel about this issue.
Assigned to: Ralph
1.5) MPI_ABORT rework
The MPI process aborting issue is going to require a rework of
MPI_ABORT so that it uses the error manager instead of calling
terminate_proc/terminate_job.
Assigned to: Brian
2) ORTE hangs when start-up fails
If an orted fails to start or fails to connect back to the HNP, the
system hangs waiting for the callback. If a orted process fails to
start entirely, we sometimes catch this. But we need a better
mechanism for handling the general failure case.
Assigned to: Ralph
3) Hardened cleanup of session directory
While #1 should greatly help in ensuring that the session directory
is cleaned up every time, there are still a number of race
conditions that need to be sorted out. The goal is to develop a
plan that ensures files that need to be removed are removed
automatically a high percentage of the time, that there is a way to
allow a tool like orte_clean to clean up everything it should clean
up, and that there is a way to make sure files that should not be
automatically removed aren't automatically removed.
Assigned to: Brian
3.5) Process not found hangs
See https://svn.open-mpi.org/trac/ompi/ticket/245
Assigned to: Ralph
7) Node death failures / hangs
With the exception of BProc, if a node fails, we don't detect the
failure. Even if we did detect the failure, we have no general
mechanism for dealing with that failure. The bulk of this project
is going to be adding a general SOH/SMR component that uses the OOB
for timeout pings.
Assigned to: Brian
15) More friendly error messages
There are situations where we give something south of a useful
error message when an error is found. We should play nicer with
users.
Assigned to:
16) Consistent error checking
We've had a number of recent instances of errors occuring, but not
being propogated / returned to the user simply because no one ever
checked the return code. We need to audit most of ORTE to always
check return codes.
Assigned to:
FEATURE / PLATFORM SUPPORT:
4) TM error handling
TM, while used on a number of large systems LANL needs to support,
is not exactly friendly to usage at scale. It seems that it likes
to go away and cry to mamma for a couple seconds, returning system
error messages, only to come back and be ok a second later. This
means that every TM call needs to be handled as if it's going to
fail, and we need to be prepared to re-initialize the system (if
possible) when failures occur. In testing on t-bird, launching was
usually pretty stable, but the calls to get the node allocations
tended to result in the strange behavior. These should definitely
be re-startable type errors
Assigned to: Brian
8) Hetergeneous Issues
Assigned to:
9) External connections
This covers issues like those the Eclipse team is experiencing.
If, for example, a TCP connection to the seed is severed, it causes
Open RTE to call abort, which means Eclipse just aborted. That's
not so good. There are other naming / status issues that also need
to be handled here.
Assigned to:
9.5) Fix/Complete orte-ps and friends
orte-ps / orte-clean / etc. all depend on being able to make a
connection to the orte universe that doesn't result in bad things
happening. We should finish these things for obvious reasons.
Assigned to:
10) Remote connections
This is similar to #9, but includes the ability to start a remote
HNP process
Assigned to:
11) Dynamic MPI-2 support
ORTE's support for the MPI-2 dynamics has some well-known issues.
In addition, we need to change some behaviors at the Open MPI level
to behave better.
Assigned to:
12) XCPU support
The XCPU system is a distributed process management system
implemented using the Plan 9 filesystem. An RAS (possibly) and PLS
are needed to support launching on XCPU systems.
Assigned to:
13) Multi-cell support
Assigned to:
14) Memory usage / null components
This is related to an e-mail Ralph or Jeff sent yesterday regarding
support for NULL components. The idea is to not load all the
components into memory if null is specified as the prefered
component name.
Assigned to:
15) RAS multi-component issues
If you are in an allocation (say, TM or BProc) and try to specify
--hostfile on the orterun command line, the hostfile option will be
ignored and you'll use the previous allocation. There are some
other similar cases, all of which can result in rather unexpected
behaviour from the user's point of view
Assigned to:
|