Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?
From: Guanyinzhu (buptzhugy_at_[hidden])
Date: 2009-04-01 05:27:49

  I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on Redhat Linux x86_64.


I run a test like this: just killed the orted process and the job hung for a long time (hang for 2~3 hours then I killed the job).


I have the follow questions:


     when network failed or host failed or orted deamon was killed by accident, How long would the running mpi job notice and exit?


     Does OpenMPI support a heartbeat mechanism or how could I fast detect the failture to avoid the mpi job hang£¿



thanks a lot!