There is indeed a heartbeat mechanism you can use - it is "off" by default. You can set it to check every N seconds with:
-mca orte_heartbeat_rate N
on your command line. Or if you want it to always run, add "orte_heartbeat_rate = N" to your default MCA param file. OMPI will declare the orted "dead" if two consecutive heartbeats are not seen.
Let me know how it works for you - it hasn't been extensively tested, but has worked so far.
On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote: