Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to make a job abort when one host dies?
From: Oskar Enoksson (enok_at_[hidden])
Date: 2009-08-18 10:59:25


Scott Atchley <atchley_at_[hidden]> wrote:

> Long answer:
>
> The messages below indicate that these processes were all trying to
> send to cl120. It did not ack their messages after 1000 resend
> attempts (each retry is attempted with a 0.5 second interval) which is
> about 8.3 minutes (500 seconds).
>
> The messages also indicate that the message was a send_small which
> means it was 128 bytes or less. MX has MPI like semantics and allow
> for completion after the message has been either buffered or
> delivered. In this case, it was buffered and OMPI was most likely able
> to complete it successfully. The message was not able to be delivered,
> however, and its timeout caused MX to fail all future sends to that
> host. On the next mx_isend(), OMPI will detect a failure.
>
> Since it does not detect failure, my guess is that the process has not
> tried to send again to that host. They then end up waiting forever.
>
> They can change MX's behavior so that it does not complete a send
> until the receiver has acked it by exporting:
>
> MX_ZOMBIE_SEND=0
>
> This will hurt benchmark performance, but real application performance
> should not be affected.
>
> The question is, however, why is cl120 not acking messages? What is
> the application? What MPI calls does this application use?
>
> Scott
>
The reason in this case was that cl120 had some kind of hardware
problem, perhaps memory error or myrinet NIC hardware error. The system
hung.

I will try MX_ZOMBIE_SEND=0, thanks for the hint!

But still I'm curious, is there no way to have some kind of time out
time limit on the waiting hosts? E.g. one hour?