I would like to say we clean up perfectly, but.... :-(
The system does try its best. I'm a little surprised here since we usually
clean up when an application process fails. Our only known problems are when
one or more of the orteds fail, usually due to a node rebooting or failing.
We hope to plug that "hole" in the spring.
You might try updating to a later version (we are at r12890+ now). I don't
think that will totally solve the problem, but it might help.
We are working on an "orteclean" program that people can use when the system
doesn't clean up properly - it will go through and kill any remaining orteds
and cleanup those session directories. Hopefully, we will have something out
in that regard in Jan.
Meantime, I'll take a look again at the scenario you described and see what
I can do.
Thanks for your patience!
On 12/19/06 2:53 AM, "Lydia Heck" <lydia.heck_at_[hidden]> wrote:
> A job which crashes with an floating point underflow (or any IEEE floating
> exception) fails to clean up after itself using
> openmpi-1.3a1r12695 ..
> Nodes with copies of slaves are sitting there ...
> I also noticed that orted are left behind on other crashed jobs ..
> Should I have to expect this?
> Dr E L Heck
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
> DURHAM, DH1 3LE
> United Kingdom
> e-mail: lydia.heck_at_[hidden]
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> users mailing list