On 09/03/2010 10:05 PM, Jeff Squyres wrote:
> On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote:
>> Backing off the polling rate requires more application-specific logic like that offered below, so it is a little difficult for us to implement at the MPI library level. Not saying we eventually won't - just not sure anyone quite knows how to do so in a generalized form.
> FWIW, we've *talked* about this kind of stuff among the developers -- it's at least somewhat similar to the "backoff to blocking communications instead of polling communications" issues. That work in particular has been discussed for a long time but never implemented.
> Are your jobs hanging because of deadlock (i.e., application error), or infrastructure error? If they're hanging because of deadlock, there are some PMPI-based tools that might be able to help.
These are application deadlocks (like the well-known VASP calling MPI_Finalize when
it should be calling MPI_Abort!). But I'm asking as a system manager with dozens of
apps run by dozens of users hanging and not being noticed for a day or two because
users are not attentive and, from outside the job, everything looks OK. So the problem
is detection. Are you suggesting there are PMPI approaches we could apply to every
production job on the system?
I now have a hack to opal_progress that seems to do what we want without any impact
on performance in the "good" case. It basically involves keeping count of the number
of contiguous calls to opal_progress with no events completed. When that hits a large
number (eg 10^9), sleeping (maybe up to a second) on every, say, 10^3-10^4 passes
through opal_progress seems to do "the right thing". (Obviously, any event completion
resets everything to spinning.) There are a few magic numbers there that need to
be overrideable by users. Please let me know if this idea is blatantly flawed.