Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Cannot run a job with more than 3 nodes
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-12 06:15:19


Are all names resolvable from all servers?

I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?

On Mar 12, 2014, at 4:07 AM, Victor <victor.major_at_[hidden]> wrote:

> Hostname.... no I use lower case, but for some reason while I was writing the email I thought that upper case is clearer...
>
> The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the executable are shared via nfs.
>
>
> On 12 March 2014 16:01, Reuti <reuti_at_[hidden]> wrote:
> Hi,
>
> Am 12.03.2014 um 07:37 schrieb Victor:
>
> > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
> >
> > I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
> >
> > I can log into each node using ssh and certificate method from the shell that is running the mpi job, by sing their name as defined in /etc/hosts.
> >
> > I can run an mpi job if I include only 3 nodes in the hostfile, for example:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
>
> You are using an uppercase name here by intention - this is the one the host returns by `hostname`? Although it is allowed and should be mangled to lowercase resp. ignored for hostname resolution, I found that not all programs are doing it. Best is to use only lowercase characters is my experience.
>
> The same version of your Ubuntu Linux is installed on all machines?
>
> -- Reuti
>
>
> > But if I add a fourth node into the hostfile eg:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> >
> > I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:
> >
> > ssh: Could not resolve hostname Node4: Name or service not known.
> >
> > But, I can log into Node4 using ssh from the same shell by using ssh Node4.
> >
> > Also if I mix up the hostfile like this for example and place Node1 to the last spot:
> >
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node1 slots=8 max-slots=8
> >
> > The error becomes
> >
> > ssh: Could not resolve hostname Node1: Name or service not known.
> >
> > If I then go back to the three node hostfile like this:
> >
> > Node1 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> >
> > There is no error with three nodes even though both Node1 and Node4 "cannot be found" if they are present in a 4 node hostfile in the last spot. The last slot seems to be bugged.
> >
> > What is going on? How do I fix this?
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/