Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun is using one PBS node only
From: Gus Correa (gus_at_[hidden])
Date: 2009-12-01 18:36:15


Hi Belaid Moa

The OpenMPI I install and use is on a NFS mounted directory.
Hence, all the nodes see the same version, which has "tm" support.

After reading your OpenMPI configuration parameters on the headnode
and working nodes (and the difference between them),
I would guess (just a guess) that the problem you see is because your
OpenMPI version on the nodes (probably) do not have Torque support.

However, you should first verify that this is really the case,
because if the OpenMPI configure script
finds the torque libraries it will (probably) configure and
install OpenMPI with "tm" support, even if you don't ask it
explicitly on the working nodes.
Hence, ssh to WN1 or WN2 and do "ompi_info" to check this out first.

If there is no Torque on WN1 and WN2 then OpenMPI won't find it
and you won't have "tm" support on the nodes.

In any case, if OpenMPI "tm" support is missing on WN[1,2},
I would suggest that you reinstall OpenMPI on WN1 and WN2 *with tm support*.
This will require that you have Torque on the working nodes also,
and use the same configure command line that you used on the headnode.

A low-tech alternative is to copy over your OpenMPI directory tree to
the WN1 and WN2 nodes.

A yet simpler alternative is to reinstall OpenMPI on the headnode
on a NFS mounted directory (as I do here), then
add the corresponding "bin" path to your PATH,
and the corresponding "lib" path to your LD_LIBRARY_PATH environment
variables.

Think about maintenance, and upgrades:
On an NFS mounted directory
you need to install only once, whereas the way you have it now you need
to do it N+1 times (or have a mechanism to propagate a single
installation from the head node to the compute nodes).

NFS is your friend! :)

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Belaid MOA wrote:
> I tried -bynode option but it did not change anything. I also tried the
> "hostname" name command and
> I keep getting only the name of one node repeated according to the -n
> value.
>
> Just to make sure I did the right installation, here is what I did:
>
> -- On the head node (HN), I installed openMPI using the --with-tm option
> as follows:
>
> ./configure --with-tm=/var/spool/torque --enable-static
> make install all
>
> -- On the worker nodes (WN1 and WN2), I installed openMPI without tm
> option as follows (it is a local installation on each worker node):
>
> ./configure --enable-static
> make install all
>
> Is this correct?
>
> Thanks a lot in advance.
> ~Belaid.
> > Date: Tue, 1 Dec 2009 17:07:58 -0500
> > From: gus_at_[hidden]
> > To: users_at_[hidden]
> > Subject: Re: [OMPI users] mpirun is using one PBS node only
> >
> > Hi Belaid Moa
> >
> > Belaid MOA wrote:
> > > Thanks a lot Gus for you help again. I only have one CPU per node.
> > > The -n X option (no matter what the value of X is) shows X processes
> > > running on one node only (the other one is free).
> >
> > So, somehow it is oversubscribing your single processor
> > on the first node.
> >
> > A simple diagnostic:
> >
> > Have you tried to run "hostname" on the two nodes through Torque/PBS
> > and mpiexec?
> >
> > [PBS directives, cd $PBS_O_WORKDIR, etc]
> > ...
> > /full/path/to/openmpi/bin/mpiexec -n 2 hostname
> >
> > Try also with the -byslot and -bynode options.
> >
> >
> > > If I add the machinefile option with WN1 and WN2 in it, the right
> > > behavior is manifested. According to the documentation,
> > > mpirun should get the PBS_NODEFILE automatically from the PBS.
> >
> > Yes, if you compiled OpenMPI you are using with Torque ("tm) support.
> > Did you?
> > Make sure the it has tm support.
> > Run "ompi_info" with full path if needed, to check that.
> > Are you sure the correct path to what you want is
> > /usr/local/bin/mpirun ?
> > Linux distributions, compilers, and other tools come with their
> > mpiexec and put them in places that you may not suspect, to better
> > double check you get what you want.
> > That has been a source of repeated confusion on this and other
> > mailing lists.
> >
> > Also, make sure that passwordless ssh across the nodes is working.
> >
> > Yet another thing to check, for easy name resolution,
> > your /etc/hosts file on *all*
> > nodes including the headnode should
> > have a list of all nodes and their IP addresses.
> > Something like this:
> >
> > 127.0.0.1 localhost.localdomain localhost
> > 192.168.0.1 WN1
> > 192.168.0.2 WN2
> >
> > (The IPs above are guesswork of mine, you know better which to use.)
> >
> > > So, I do
> > > not need to use machinefile.
> > >
> >
> > True assuming the first condition above (OpenMPI *with* "tm" suport).
> >
> > > Any ideas?
> > >
> >
> > Yes, and I sent it to you on my last email!
> > Try the "-bynode" option of mpiexec.
> > ("man mpiexec" is your friend!)
> >
> > > Thanks a lot in advance.
> > > ~Belaid.
> > >
> >
> > Best of luck!
> > Gus Correa
> > ---------------------------------------------------------------------
> > Gustavo Correa
> > Lamont-Doherty Earth Observatory - Columbia University
> > Palisades, NY, 10964-8000 - USA
> > ---------------------------------------------------------------------
> >
> > PS - Your web site link to Paul Krugman is out of date.
> > Here are one to his (active) blog,
> > and another to his (no longer updated) web page: :)
> >
> > http://krugman.blogs.nytimes.com/
> > http://www.princeton.edu/~pkrugman/
> >
> > >
> > > > Date: Tue, 1 Dec 2009 15:42:30 -0500
> > > > From: gus_at_[hidden]
> > > > To: users_at_[hidden]
> > > > Subject: Re: [OMPI users] mpirun is using one PBS node only
> > > >
> > > > Hi Belaid Moa
> > > >
> > > > Belaid MOA wrote:
> > > > > Hi everyone,
> > > > > Here is another elementary question. I tried the following
> steps found
> > > > > in the FAQ section of www.open-mpi.org with a simple hello world
> > > example
> > > > > (with PBS/torque):
> > > > > $ qsub -l nodes=2 my_script.sh
> > > > >
> > > > > my_script.sh is pasted below:
> > > > > ========================
> > > > > #!/bin/sh -l
> > > > > #PBS -N helloTest
> > > > > #PBS -j eo
> > > > > echo `cat $PBS_NODEFILE` # shows two nodes: WN1 WN2
> > > > > cd $PBS_O_WORKDIR
> > > > > /usr/local/bin/mpirun hello
> > > > > ========================
> > > > >
> > > > > When the job is submitted, only one process is ran. When I add the
> > > -n 2
> > > > > option to the mpirun command,
> > > > > two processes are ran but on one node only.
> > > >
> > > > Do you have a single CPU/core per node?
> > > > Or are they multi-socket/multi-core?
> > > >
> > > > Check "man mpiexec" for the options that control on which nodes and
> > > > slots, etc your program will run.
> > > > ("Man mpiexec" will tell you more than I possibly can.)
> > > >
> > > > The default option is "-byslot",
> > > > which will use all "slots" (actually cores
> > > > or CPUs) available on a node before it moves to the next node.
> > > > Reading your question and your surprise with the result,
> > > > I would guess what you want is "-bynode" (not the default).
> > > >
> > > > Also, if you have more than one CPU/core per node,
> > > > you need to put this information in your Torque/PBS "nodes" file
> > > > (and restart your pbs_server daemon).
> > > > Something like this (for 2 CPUs/cores per node):
> > > >
> > > > WN1 np=2
> > > > WN2 np=2
> > > >
> > > > I hope this helps,
> > > > Gus Correa
> > > > ---------------------------------------------------------------------
> > > > Gustavo Correa
> > > > Lamont-Doherty Earth Observatory - Columbia University
> > > > Palisades, NY, 10964-8000 - USA
> > > > ---------------------------------------------------------------------
> > > >
> > > >
> > > > > Note that echo `cat
> > > > > $PBS_NODEFILE` outputs
> > > > > the two nodes I am using: WN1 and WN2.
> > > > >
> > > > > The output from ompi_info is shown below:
> > > > >
> > > > > $ ompi_info | grep tm
> > > > > MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
> > > > > MCA ras: tm (MCA v2.0, API v2.0, Component v1.3.3)
> > > > > MCA plm: tm (MCA v2.0, API v2.0, Component v1.3.3)
> > > > >
> > > > > Any help on why openMPI/mpirun is using only one PBS node is very
> > > > > appreciated.
> > > > >
> > > > > Thanks a lot in advance and sorry for bothering you guys with my
> > > > > elementary questions!
> > > > >
> > > > > ~Belaid.
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> ------------------------------------------------------------------------
> > > > > Windows Live: Keep your friends up to date with what you do online.
> > > > > <http://go.microsoft.com/?linkid=9691810>
> > > > >
> > > > >
> > > > >
> > >
> ------------------------------------------------------------------------
> > > > >
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users_at_[hidden]
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> ------------------------------------------------------------------------
> > > Windows Live: Keep your friends up to date with what you do online.
> > > <http://go.microsoft.com/?linkid=9691810>
> > >
> > >
> > >
> ------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ------------------------------------------------------------------------
> Get a great deal on Windows 7 and see how it works the way you want. See
> the Windows 7 offers now. <http://go.microsoft.com/?linkid=9691813>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users