Patrick Geoffray <patrick_at_[hidden]> wrote:
> Hi Oskar,
> Oskar Enoksson wrote:
>> The reason in this case was that cl120 had some kind of hardware
>> problem, perhaps memory error or myrinet NIC hardware error. The system
>> I will try MX_ZOMBIE_SEND=0, thanks for the hint!
> I would not recommend to use that setting. It will affect performance,
> use a code path that is less tested and not really address the problem.
> As small messages are buffered in MX, a send can return immediately as
> the send buffer can be reused right away. However, if the MX lib fail to
> reliably deliver the message, it will eventually call the asynchronous
> error handler to report the problem. The default async error handler
> only prints a message, leaving to the application the choice of
> recovery. The right way to address the problem would be for OpenMPI to
> register its own asynchronous error handler in the MX BTL/MTL, and
> signal to ORTE to terminate the job when a send timeout has occurred.
> We will implement this mechanism and push it on the trunk shortly.
Sounds great, I'm looking forward to it. Thanks a lot.