Thanks for the help. I've replied below.
--- "G.O." <gurhan.ozen_at_[hidden]> wrote:
> 1- Check to make sure that there are no firewalls blocking
> traffic between the nodes.
There is no firewall in-between the nodes. If I run jobs directly via
ssh, e.g. "ssh node4 env" they work.
> 2 - Check to make sure that all nodes have the openmpi installed
> and have the very same executable you are trying to run on the same
> path, have all permissions correctly.
Yes, they are all installed to /usr/local , the permissions are the
same, and if I just invoke mpirun on an individual node by logging into
it, it works. In fact, even commands like "ssh node4 mpirun" (just to
get the mpirun help banner) work.
> 3- Check to make sure that all nodes have the same interface,
> i.e. eth0 .
They all do have the same interfaces. In my configureation, eth1 is
the interface that corresponds to the cluster IP network. I have tried
using "--mca btl_tcp_if_include eth1" but it seems to make no
difference.
> That's all i can think of for very quick checks for now. Hope it's
> one of this.
Thank you very much, but unfortunately it isn't any of these, as far as
I can tell.
____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7
|