Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2006-12-19 08:36:01


Hi Lydia

I would like to say we clean up perfectly, but.... :-(

The system does try its best. I'm a little surprised here since we usually
clean up when an application process fails. Our only known problems are when
one or more of the orteds fail, usually due to a node rebooting or failing.
We hope to plug that "hole" in the spring.

You might try updating to a later version (we are at r12890+ now). I don't
think that will totally solve the problem, but it might help.

We are working on an "orteclean" program that people can use when the system
doesn't clean up properly - it will go through and kill any remaining orteds
and cleanup those session directories. Hopefully, we will have something out
in that regard in Jan.

Meantime, I'll take a look again at the scenario you described and see what
I can do.

Thanks for your patience!
Ralph

On 12/19/06 2:53 AM, "Lydia Heck" <lydia.heck_at_[hidden]> wrote:

>
> A job which crashes with an floating point underflow (or any IEEE floating
> point
> exception) fails to clean up after itself using
>
> openmpi-1.3a1r12695 ..
>
> Nodes with copies of slaves are sitting there ...
>
> I also noticed that orted are left behind on other crashed jobs ..
>
> Should I have to expect this?
>
> Lydia
>
> ------------------------------------------
> Dr E L Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.heck_at_[hidden]
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users