Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Cannot run a job with more than 3 nodes
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-12 06:15:19


Are all names resolvable from all servers?

I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?

On Mar 12, 2014, at 4:07 AM, Victor <victor.major_at_[hidden]> wrote:

> Hostname.... no I use lower case, but for some reason while I was writing the email I thought that upper case is clearer...
>
> The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the executable are shared via nfs.
>
>
> On 12 March 2014 16:01, Reuti <reuti_at_[hidden]> wrote:
> Hi,
>
> Am 12.03.2014 um 07:37 schrieb Victor:
>
> > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
> >
> > I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
> >
> > I can log into each node using ssh and certificate method from the shell that is running the mpi job, by sing their name as defined in /etc/hosts.
> >
> > I can run an mpi job if I include only 3 nodes in the hostfile, for example:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
>
> You are using an uppercase name here by intention - this is the one the host returns by `hostname`? Although it is allowed and should be mangled to lowercase resp. ignored for hostname resolution, I found that not all programs are doing it. Best is to use only lowercase characters is my experience.
>
> The same version of your Ubuntu Linux is installed on all machines?
>
> -- Reuti
>
>
> > But if I add a fourth node into the hostfile eg:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> >
> > I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:
> >
> > ssh: Could not resolve hostname Node4: Name or service not known.
> >
> > But, I can log into Node4 using ssh from the same shell by using ssh Node4.
> >
> > Also if I mix up the hostfile like this for example and place Node1 to the last spot:
> >
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node1 slots=8 max-slots=8
> >
> > The error becomes
> >
> > ssh: Could not resolve hostname Node1: Name or service not known.
> >
> > If I then go back to the three node hostfile like this:
> >
> > Node1 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> >
> > There is no error with three nodes even though both Node1 and Node4 "cannot be found" if they are present in a 4 node hostfile in the last spot. The last slot seems to be bugged.
> >
> > What is going on? How do I fix this?
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/