I found the problem. It looks like xgrid need to do more work on fault tolerance. It seems that xgrid controller distributed jobs to each available agent only in certain fixed order, if one of the agents has problem in communicating with the controller, all jobs failed, even when there are still more available agents.
In my case, the third node that the controller always contacted is node6, which has problem to reach it (I found the problem when I try to do remote desktop to check each node, I could not reach that node properly, the rest of other nodes are fine). After I turn of the agent on node6, the previous problem was solved, everything works fine.
Thank you .
On Oct 4, 2007, at 3:06 PM, Jinhui Qin wrote:
> sib:sharcnet$ mpirun -n 3 ~/openMPI_stuff/Hello
> Process 0.1.1 is unable to reach 0.1.2 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
This is very odd -- it looks like two of the processes don't think
they can talk to each other. Can you try running with:
mpirun -n 3 -mca btl tcp,self <app>
If that fails, then the next piece of information that would be
useful is the IP addresses and netmasks for all the nodes in your
cluster. We have some logic in our TCP communication system that can
cause some interesting results for some network topologies.
Just to verify it's not an XGrid problem, you might want to try
running with a hostfile -- I think you'll find that the results are
the same, but it's always good to verify.
devel mailing list