Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-01 09:34:46


There is indeed a heartbeat mechanism you can use - it is "off" by
default. You can set it to check every N seconds with:

-mca orte_heartbeat_rate N

on your command line. Or if you want it to always run, add
"orte_heartbeat_rate = N" to your default MCA param file. OMPI will
declare the orted "dead" if two consecutive heartbeats are not seen.

Let me know how it works for you - it hasn't been extensively tested,
but has worked so far.
Ralph

On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote:

> I mean killed the orted deamon process during the mpi job running ,
> but the mpi job hang and could't notice one of it's rank failed.
>
>
>
>
> > Date: Wed, 1 Apr 2009 19:09:34 +0800
> > From: ml.jgmbenoit_at_[hidden]
> > To: users_at_[hidden]
> > Subject: Re: [OMPI users] Beginner's question: how to avoid a
> running mpi job hang if host or network failed or orted deamon killed?
> >
> > Is there a firewall somewhere ?
> >
> > Jerome
> >
> > Guanyinzhu wrote:
> > > Hi!
> > > I'm using OpenMPI 1.3 on ten nodes connected with Gigabit
> Ethernet on
> > > Redhat Linux x86_64.
> > >
> > > I run a test like this: just killed the orted process and the
> job hung
> > > for a long time (hang for 2~3 hours then I killed the job).
> > >
> > > I have the follow questions:
> > >
> > > when network failed or host failed or orted deamon was killed by
> > > accident, How long would the running mpi job notice and exit?
> > >
> > > Does OpenMPI support a heartbeat mechanism or how c! ould I fast
> > > detect the failture to avoid the mpi job hang?
> > >
> > >
> > > thanks a lot!
> > >
> > >
> > >
> ------------------------------------------------------------------------
> > > ?MSN????,??????????! ????! <http://mobile.msn.com.cn/>
> > >
> > >
> > >
> ------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ¸ü¶àÈÈÀ±×ÊѶ¾¡ÔÚаæMSNÊ×Ò³£¡ Á¢¿Ì·ÃÎÊ£¡
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users