Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] detect hung node
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-04-06 20:35:48


On Apr 6, 2010, at 1:03 PM, Sam Preston wrote:

> I have a problem with the cluster I'm currently using where nodes
> 'hang' silently from time to time during an MPI call. This causes the
> blocked MPI processes to block indefinitely -- the only way to detect
> an error is to notice that no more output is being written to the log
> files. We're trying to resolve the underlying cause of the nodes
> hanging, but in the mean time, is there a way to set a timeout or
> something similar to detect this situation? Sorry if this has been
> addressed before, I searched the FAQ and archives and didn't come up
> with anything.

Unfortunately, no. MPI doesn't actively check to see if an application has deadlocked (although there are tools for doing this kind of thing -- google around for them). Or if something has gone wrong, Open MPI may not be detecting it properly. Hopefully, it's not an Open MPI bug!

I wish I had more helpful information for you -- let us know what you find about the underlying cause.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/