Scott Atchley <atchley_at_[hidden]> wrote:
> Long answer:
> The messages below indicate that these processes were all trying to
> send to cl120. It did not ack their messages after 1000 resend
> attempts (each retry is attempted with a 0.5 second interval) which is
> about 8.3 minutes (500 seconds).
> The messages also indicate that the message was a send_small which
> means it was 128 bytes or less. MX has MPI like semantics and allow
> for completion after the message has been either buffered or
> delivered. In this case, it was buffered and OMPI was most likely able
> to complete it successfully. The message was not able to be delivered,
> however, and its timeout caused MX to fail all future sends to that
> host. On the next mx_isend(), OMPI will detect a failure.
> Since it does not detect failure, my guess is that the process has not
> tried to send again to that host. They then end up waiting forever.
> They can change MX's behavior so that it does not complete a send
> until the receiver has acked it by exporting:
> This will hurt benchmark performance, but real application performance
> should not be affected.
> The question is, however, why is cl120 not acking messages? What is
> the application? What MPI calls does this application use?
The reason in this case was that cl120 had some kind of hardware
problem, perhaps memory error or myrinet NIC hardware error. The system
I will try MX_ZOMBIE_SEND=0, thanks for the hint!
But still I'm curious, is there no way to have some kind of time out
time limit on the waiting hosts? E.g. one hour?