On Sep 3, 2010, at 5:10 PM, David Singleton wrote:
> On 09/03/2010 10:05 PM, Jeff Squyres wrote:
>> On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote:
>>> Backing off the polling rate requires more application-specific logic like that offered below, so it is a little difficult for us to implement at the MPI library level. Not saying we eventually won't - just not sure anyone quite knows how to do so in a generalized form.
>> FWIW, we've *talked* about this kind of stuff among the developers -- it's at least somewhat similar to the "backoff to blocking communications instead of polling communications" issues. That work in particular has been discussed for a long time but never implemented.
>> Are your jobs hanging because of deadlock (i.e., application error), or infrastructure error? If they're hanging because of deadlock, there are some PMPI-based tools that might be able to help.
> These are application deadlocks (like the well-known VASP calling MPI_Finalize when
> it should be calling MPI_Abort!). But I'm asking as a system manager with dozens of
> apps run by dozens of users hanging and not being noticed for a day or two because
> users are not attentive and, from outside the job, everything looks OK. So the problem
> is detection. Are you suggesting there are PMPI approaches we could apply to every
> production job on the system?
> I now have a hack to opal_progress that seems to do what we want without any impact
> on performance in the "good" case. It basically involves keeping count of the number
> of contiguous calls to opal_progress with no events completed. When that hits a large
> number (eg 10^9), sleeping (maybe up to a second) on every, say, 10^3-10^4 passes
> through opal_progress seems to do "the right thing". (Obviously, any event completion
> resets everything to spinning.) There are a few magic numbers there that need to
> be overrideable by users. Please let me know if this idea is blatantly flawed.
I once implemented something like this to help with debugging. There are a few gotchas, though - here are some off the top of my head based on my prior similar attempt:
1. some of the MPI transports don't rely on the event library but instead poll on their own thread. So this won't detect those cases.
2. you have now introduced some overhead into the progress engine, which is in the critical path for those transports that use it, so your latencies will definitely increase. This may or may not be apparent at the application level - depends on the app. It will show up, however, in benchmarks aimed at latency.
3. the progress engine isn't running in its own thread, so "sleeping" the progress engine "sleeps" the process, thus preventing it from doing anything. So if you are looking for a non-blocking recv, for example, you just put your process to sleep instead of letting it work because the message hasn't arrived yet.
> users mailing list