In the upcoming 1.5 series, we will introduce a new "sensor" framework to help resolve such issues. Among other things, it will automatically track (if requested) the size of a sentinel file, cpu usage, and memory footprint and will terminate the job if any exceed user-specified limits (e.g., file doesn't grow fast enough, memory grows too large).
Backing off the polling rate requires more application-specific logic like that offered below, so it is a little difficult for us to implement at the MPI library level. Not saying we eventually won't - just not sure anyone quite knows how to do so in a generalized form.
On Sep 2, 2010, at 7:46 PM, Douglas Guptill wrote:
> Hi David:
> On Fri, Sep 03, 2010 at 10:50:02AM +1000, David Singleton wrote:
>> I'm sure this has been discussed before but having watched hundreds of
>> thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd
>> be keen to know why there isn't some sort of "spin-wait backoff" option.
>> For example, a way to specify spin-wait for x seconds/cycles/iterations
>> then backoff to lighter and lighter cpu usage. At least that way, hung
>> jobs would become self-evident.
>> Maybe there is already some way of doing this?
> For my solution to this, see
> Douglas Guptill voice: 902-461-9749
> Research Assistant, LSC 4640 email: douglas.guptill_at_[hidden]
> Oceanography Department fax: 902-494-3877
> Dalhousie University
> Halifax, NS, B3H 4J1, Canada
> users mailing list