Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Cannot run a job with more than 3 nodes
From: Victor (victor.major_at_[hidden])
Date: 2014-03-12 04:07:31


Hostname.... no I use lower case, but for some reason while I was writing
the email I thought that upper case is clearer...

The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the
executable are shared via nfs.

On 12 March 2014 16:01, Reuti <reuti_at_[hidden]> wrote:

> Hi,
>
> Am 12.03.2014 um 07:37 schrieb Victor:
>
> > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd
> problem.
> >
> > I have 4 nodes, all of which are defined in the hostfile and in
> /etc/hosts.
> >
> > I can log into each node using ssh and certificate method from the shell
> that is running the mpi job, by sing their name as defined in /etc/hosts.
> >
> > I can run an mpi job if I include only 3 nodes in the hostfile, for
> example:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
>
> You are using an uppercase name here by intention - this is the one the
> host returns by `hostname`? Although it is allowed and should be mangled to
> lowercase resp. ignored for hostname resolution, I found that not all
> programs are doing it. Best is to use only lowercase characters is my
> experience.
>
> The same version of your Ubuntu Linux is installed on all machines?
>
> -- Reuti
>
>
> > But if I add a fourth node into the hostfile eg:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> >
> > I get this error after attempting mpirun -np 32 --hostfile hostfile
> a.out:
> >
> > ssh: Could not resolve hostname Node4: Name or service not known.
> >
> > But, I can log into Node4 using ssh from the same shell by using ssh
> Node4.
> >
> > Also if I mix up the hostfile like this for example and place Node1 to
> the last spot:
> >
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node1 slots=8 max-slots=8
> >
> > The error becomes
> >
> > ssh: Could not resolve hostname Node1: Name or service not known.
> >
> > If I then go back to the three node hostfile like this:
> >
> > Node1 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> >
> > There is no error with three nodes even though both Node1 and Node4
> "cannot be found" if they are present in a 4 node hostfile in the last
> spot. The last slot seems to be bugged.
> >
> > What is going on? How do I fix this?
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>