Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] users Digest, Vol 1321, Issue 6
From: Oskar Enoksson (enok_at_[hidden])
Date: 2009-08-18 17:39:04


Patrick Geoffray <patrick_at_[hidden]> wrote:
> Hi Oskar,
>
> Oskar Enoksson wrote:
>> The reason in this case was that cl120 had some kind of hardware
>> problem, perhaps memory error or myrinet NIC hardware error. The system
>> hung.
>>
>> I will try MX_ZOMBIE_SEND=0, thanks for the hint!
>
> I would not recommend to use that setting. It will affect performance,
> use a code path that is less tested and not really address the problem.
>
> As small messages are buffered in MX, a send can return immediately as
> the send buffer can be reused right away. However, if the MX lib fail to
> reliably deliver the message, it will eventually call the asynchronous
> error handler to report the problem. The default async error handler
> only prints a message, leaving to the application the choice of
> recovery. The right way to address the problem would be for OpenMPI to
> register its own asynchronous error handler in the MX BTL/MTL, and
> signal to ORTE to terminate the job when a send timeout has occurred.
>
> We will implement this mechanism and push it on the trunk shortly.
>
> Thanks

Sounds great, I'm looking forward to it. Thanks a lot.