Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Orte cleanup
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-03-06 10:36:54

I believe I have at least helped reduce this with r17761. I added the
ability for procs to detect that their "lifeline" connection (either the HNP
for unity routed, or their local daemon for tree) has been lost and
gracefully abort.

Let me know if that helps

On 3/4/08 9:37 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]> wrote:

> I noticed that the new release of orte is not as good as it used to be
> to cleanup the mess left by crashed/aborted mpi processes. Recently We
> have been experiencing a lot of zombie or live locked processes
> running on the cluster nodes and disturbing following experiments. I
> didn't really had time to investigate the issue, maybe ralph can set a
> ticket if he is able to reproduce this.
> Aurelien
> --
> * Dr. Aurélien Bouteiller
> * Sr. Research Associate at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 350
> * Knoxville, TN 37996
> * 865 974 6321
> _______________________________________________
> devel mailing list
> devel_at_[hidden]