There are some timeout issues you can see with large clusters on Torque - check the Torque web site for explanations and instructions on what to do about it. However, that doesn't appear to be the problem here.
If our daemon doesn't report back, it is typically due to one or more of the following reasons:
1. it couldn't start because it didn't find the required libraries.
2. it couldn't report back because it hit a firewall
3. it couldn't report back because it didn't find a network that would get it back to mpirun
>From your other note, it sounds like #3 might be the problem here. Do you have some nodes that are configured with "eth0" pointing to your 10.x network, and other nodes with "eth0" pointing to your 192.x network? I have found that having interfaces that share a name but are on different IP addresses sometimes causes OMPI to miss-connect.
If you randomly got some of those nodes in your allocation, that might explain why your jobs sometimes work and sometimes don't.
On May 28, 2010, at 3:23 PM, Rahul Nabar wrote:
> On Fri, May 28, 2010 at 3:53 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> What environment are you running on the cluster, and what version of OMPI? Not sure that error message is coming from us.
> The cluster runs PBS-Torque. So I guess, that could be the other error source.
> users mailing list