On Aug 18, 2009, at 10:59 AM, Oskar Enoksson wrote:
>> The question is, however, why is cl120 not acking messages? What
>> is the application? What MPI calls does this application use?
>>
>> Scott
>
> The reason in this case was that cl120 had some kind of hardware
> problem, perhaps memory error or myrinet NIC hardware error. The
> system hung.
>
> I will try MX_ZOMBIE_SEND=0, thanks for the hint!
>
> But still I'm curious, is there no way to have some kind of time out
> time limit on the waiting hosts? E.g. one hour?
There is a send timeout in MX. There is no receive timeout in MPI or MX.
The application could add pending receives with a timestamp to a
pending queue and walk the queue periodically. If it finds a receive
that has exceeded the application's threshold, it could call
MPI_Cancel().
Scott
|