On Apr 6, 2010, at 1:03 PM, Sam Preston wrote:
> I have a problem with the cluster I'm currently using where nodes
> 'hang' silently from time to time during an MPI call. This causes the
> blocked MPI processes to block indefinitely -- the only way to detect
> an error is to notice that no more output is being written to the log
> files. We're trying to resolve the underlying cause of the nodes
> hanging, but in the mean time, is there a way to set a timeout or
> something similar to detect this situation? Sorry if this has been
> addressed before, I searched the FAQ and archives and didn't come up
> with anything.
Unfortunately, no. MPI doesn't actively check to see if an application has deadlocked (although there are tools for doing this kind of thing -- google around for them). Or if something has gone wrong, Open MPI may not be detecting it properly. Hopefully, it's not an Open MPI bug!
I wish I had more helpful information for you -- let us know what you find about the underlying cause.
For corporate legal information go to: