Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to make a job abort when one host dies?
From: Scott Atchley (atchley_at_[hidden])
Date: 2009-08-18 11:35:16

On Aug 18, 2009, at 10:59 AM, Oskar Enoksson wrote:

>> The question is, however, why is cl120 not acking messages? What
>> is the application? What MPI calls does this application use?
>> Scott
> The reason in this case was that cl120 had some kind of hardware
> problem, perhaps memory error or myrinet NIC hardware error. The
> system hung.
> I will try MX_ZOMBIE_SEND=0, thanks for the hint!
> But still I'm curious, is there no way to have some kind of time out
> time limit on the waiting hosts? E.g. one hour?

There is a send timeout in MX. There is no receive timeout in MPI or MX.

The application could add pending receives with a timestamp to a
pending queue and walk the queue periodically. If it finds a receive
that has exceeded the application's threshold, it could call