Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Orte cleanup
From: Aurélien Bouteiller (bouteill_at_[hidden])
Date: 2008-03-05 09:39:13


Scenario 2 is definitely one of those we have been experienced (we are
making some changes to orte and this lead some orted to crash). I will
try to find a way to reproduce easily the other one, where aborted MPI
processes are left behind (but no orted).

Thanks,
Aurelien

Le 5 mars 08 à 08:43, Ralph H Castain a écrit :

> Awesome. I haven't been seeing this behavior, but I won't swear that
> it is
> anywhere near fully tested.
>
> A couple of possibilities come to mind:
>
> 1. are you building threaded? If so, then all bets are off. The new
> release
> of orte depends heavily on libevent. As George pointed out on the Tues
> telecon, libevent is definitely not thread safe. So, if you are
> building
> threaded, you can just about guarantee a problem will occur,
> especially if
> something crashes
>
> 2. are the orteds crashing? If so, and you are using the tree routed
> module
> (which is the default), then application procs will be blocked from
> finalizing since they will not be able to complete the barrier in
> MPI_Finalize. That barrier relies on the RML to communicate between
> each
> process and the rank=0 process. In the tree routed module, all RML
> communications is done through the local daemon - if that daemon
> dies during
> the job, then comm is broken. There currently is no recovery
> mechanism, nor
> does the OOB sense that the daemon socket is gone and abort the
> proc. We
> probably need to develop at least a method for doing the latter so
> that
> things don't just hang.
>
> That is all I can think of immediately. If you can tell me more
> about the
> scenario, I can try to look at it.
>
> Thanks
> Ralph
>
>
>
> On 3/4/08 9:37 PM, "Aurélien Bouteiller" <bouteill_at_[hidden]>
> wrote:
>
>> I noticed that the new release of orte is not as good as it used to
>> be
>> to cleanup the mess left by crashed/aborted mpi processes. Recently
>> We
>> have been experiencing a lot of zombie or live locked processes
>> running on the cluster nodes and disturbing following experiments. I
>> didn't really had time to investigate the issue, maybe ralph can
>> set a
>> ticket if he is able to reproduce this.
>>
>> Aurelien
>> --
>> * Dr. Aurélien Bouteiller
>> * Sr. Research Associate at Innovative Computing Laboratory
>> * University of Tennessee
>> * 1122 Volunteer Boulevard, suite 350
>> * Knoxville, TN 37996
>> * 865 974 6321
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel