Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
From: tmishima_at_[hidden]
Date: 2013-03-20 22:05:20


Hi Ralph,

I have a line below in ~/.openmpi/mca-params.conf to use rsh.
orte_rsh_agent = /usr/bin/rsh

I changed this line to:
plm_rsh_agent = /usr/bin/rsh # for openmpi-1.7

Then, the error message disappeared. Thanks.

Retruning to the subject, I can rebuild with --enable-debug.
Just wait until it will complete.

Regards,
Tetsuya Mishima

> You obviously have some MCA params set somewhere:
>
> >
--------------------------------------------------------------------------
> > A deprecated MCA parameter value was specified in an MCA parameter
> > file. Deprecated MCA parameters should be avoided; they may disappear
> > in future releases.
> >
> > Deprecated parameter: orte_rsh_agent
> >
--------------------------------------------------------------------------
>
> Check your environment for anything with OMPI_MCA_xxx, and your default
MCA parameter file to see what has been specified.
>
> The allocation looks okay - I'll have to look for other debug flags you
can set. Meantime, can you please add --enable-debug to your configure cmd
line and rebuild?
>
> Thanks
> Ralph
>
>
> On Mar 20, 2013, at 4:39 PM, tmishima_at_[hidden] wrote:
>
> >
> >
> > Hi Ralph,
> >
> > Here is a result of rerun with --display-allocation.
> > I set OMP_NUM_THREADS=1 to make the problem clear.
> >
> > Regards,
> > Tetsuya Mishima
> >
> > P.S. As far as I checked, these 2 cases are OK(no problem).
> > (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
> > ~/Ducom/testbed/mPre m02-ld
> > (2)mpirun -v -x OMP_NUM_THREADS --display-allocation
~/Ducom/testbed/mPre
> > m02-ld
> >
> > Script File:
> >
> > #!/bin/sh
> > #PBS -A tmishima
> > #PBS -N Ducom-run
> > #PBS -j oe
> > #PBS -l nodes=2:ppn=4
> > export OMP_NUM_THREADS=1
> > cd $PBS_O_WORKDIR
> > cp $PBS_NODEFILE pbs_hosts
> > NPROCS=`wc -l < pbs_hosts`
> > mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
> > --display-allocation ~/Ducom/testbed/mPre m02-ld
> >
> > Output:
> >
--------------------------------------------------------------------------
> > A deprecated MCA parameter value was specified in an MCA parameter
> > file. Deprecated MCA parameters should be avoided; they may disappear
> > in future releases.
> >
> > Deprecated parameter: orte_rsh_agent
> >
--------------------------------------------------------------------------
> >
> > ====================== ALLOCATED NODES ======================
> >
> > Data for node: node06 Num slots: 4 Max slots: 0
> > Data for node: node05 Num slots: 4 Max slots: 0
> >
> > =================================================================
> >
--------------------------------------------------------------------------
> > A hostfile was provided that contains at least one node not
> > present in the allocation:
> >
> > hostfile: pbs_hosts
> > node: node06
> >
> > If you are operating in a resource-managed environment, then only
> > nodes that are in the allocation can be used in the hostfile. You
> > may find relative node syntax to be a useful alternative to
> > specifying absolute node names see the orte_hosts man page for
> > further information.
> >
--------------------------------------------------------------------------
> >
> >
> >> I've submitted a patch to fix the Torque launch issue - just some
> > leftover garbage that existed at the time of the 1.7.0 branch and
didn't
> > get removed.
> >>
> >> For the hostfile issue, I'm stumped as I can't see how the problem
would
> > come about. Could you please rerun your original test and add
> > "--display-allocation" to your cmd line? Let's see if it is
> >> correctly finding the original allocation.
> >>
> >> Thanks
> >> Ralph
> >>
> >> On Mar 19, 2013, at 5:08 PM, tmishima_at_[hidden] wrote:
> >>
> >>>
> >>>
> >>> Hi Gus,
> >>>
> >>> Thank you for your comments. I understand your advice.
> >>> Our script used to be --npernode type as well.
> >>>
> >>> As I told before, our cluster consists of nodes having 4, 8,
> >>> and 32 cores, although it used to be homogeneous at the
> >>> starting time. Furthermore, since performance of each core
> >>> is almost same, a mixed use of nodes with different number
> >>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
> >>>
> >>> --npernode type is not applicable to such a mixed use.
> >>> That's why I'd like to continue to use modified hostfile.
> >>>
> >>> By the way, the problem I reported to Jeff yesterday
> >>> was that openmpi-1.7 with torque is something wrong,
> >>> because it caused error against such a simple case as
> >>> shown below, which surprised me. Now, the problem is not
> >>> limited to modified hostfile, I guess.
> >>>
> >>> #PBS -l nodes=4:ppn=8
> >>> mpirun -np 8 ./my_program
> >>> (OMP_NUM_THREADS=4)
> >>>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
> >>>> Hi Tetsuya
> >>>>
> >>>> Your script that edits $PBS_NODEFILE into a separate hostfile
> >>>> is very similar to some that I used here for
> >>>> hybrid OpenMP+MPI programs on older versions of OMPI.
> >>>> I haven't tried this in 1.6.X,
> >>>> but it looks like you did and it works also.
> >>>> I haven't tried 1.7 either.
> >>>> Since we run production machines,
> >>>> I try to stick to the stable versions of OMPI (even numbered:
> >>>> 1.6.X, 1.4.X, 1.2.X).
> >>>>
> >>>> I believe you can get the same effect even if you
> >>>> don't edit your $PBS_NODEFILE and let OMPI use it as is.
> >>>> Say, if you choose carefully the values in your
> >>>> #PBS -l nodes=?:ppn=?
> >>>> of your
> >>>> $OMP_NUM_THREADS
> >>>> and use an mpiexec with --npernode or --cpus-per-proc.
> >>>>
> >>>> For instance, for twelve MPI processes, with two threads each,
> >>>> on nodes with eight cores each, I would try
> >>>> (but I haven't tried!):
> >>>>
> >>>> #PBS -l nodes=3:ppn=8
> >>>>
> >>>> export $OMP_NUM_THREADS=2
> >>>>
> >>>> mpiexec -np 12 -npernode 4
> >>>>
> >>>> or perhaps more tightly:
> >>>>
> >>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2
> >>>>
> >>>> I hope this helps,
> >>>> Gus Correa
> >>>>
> >>>>
> >>>>
> >>>> On 03/19/2013 03:12 PM, tmishima_at_[hidden] wrote:
> >>>>>
> >>>>>
> >>>>> Hi Reuti and Gus,
> >>>>>
> >>>>> Thank you for your comments.
> >>>>>
> >>>>> Our cluster is a little bit heterogeneous, which has nodes with 4,
8,
> >>> 32
> >>>>> cores.
> >>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for "-l
> >>>>> nodes=2:ppn=4".
> >>>>> (strictly speaking, Torque picked up proper nodes.)
> >>>>>
> >>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no
> > troble
> >>>>> against that kind
> >>>>> of use. I encountered the issue when I was evaluating openmpi-1.7
to
> >>> check
> >>>>> when we could
> >>>>> move on to it, although we have no positive reason to do that at
this
> >>>>> moment.
> >>>>>
> >>>>> As Gus pointed out, I use a script file as shown below for a
> > practical
> >>> use
> >>>>> of openmpi-1.6.x.
> >>>>>
> >>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works
fine
> >>>>> export OMP_NUM_THREADS=4
> >>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
> >>> here
> >>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -report-bindings
\
> >>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes,
> > 8-core
> >>>>> node has 2 numanodes
> >>>>>
> >>>>> It works well under the combination of openmpi-1.6.x and Torque.
The
> >>>>> problem is just
> >>>>> openmpi-1.7's behavior.
> >>>>>
> >>>>> Regards,
> >>>>> Tetsuya Mishima
> >>>>>
> >>>>>> Hi Tetsuya Mishima
> >>>>>>
> >>>>>> Mpiexec offers you a number of possibilities that you could try:
> >>>>>> --bynode,
> >>>>>> --pernode,
> >>>>>> --npernode,
> >>>>>> --bysocket,
> >>>>>> --bycore,
> >>>>>> --cpus-per-proc,
> >>>>>> --cpus-per-rank,
> >>>>>> --rankfile
> >>>>>> and more.
> >>>>>>
> >>>>>> Most likely one or more of them will fit your needs.
> >>>>>>
> >>>>>> There are also associated flags to bind processes to cores,
> >>>>>> to sockets, etc, to report the bindings, and so on.
> >>>>>>
> >>>>>> Check the mpiexec man page for details.
> >>>>>>
> >>>>>> Nevertheless, I am surprised that modifying the
> >>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7.
> >>>>>> I have done this many times in older versions of OMPI.
> >>>>>>
> >>>>>> Would it work for you to go back to the stable OMPI 1.6.X,
> >>>>>> or does it lack any special feature that you need?
> >>>>>>
> >>>>>> I hope this helps,
> >>>>>> Gus Correa
> >>>>>>
> >>>>>> On 03/19/2013 03:00 AM, tmishima_at_[hidden] wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Jeff,
> >>>>>>>
> >>>>>>> I didn't have much time to test this morning. So, I checked it
> > again
> >>>>>>> now. Then, the trouble seems to depend on the number of nodes to
> > use.
> >>>>>>>
> >>>>>>> This works(nodes< 4):
> >>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
> >>>>>>> (OMP_NUM_THREADS=4)
> >>>>>>>
> >>>>>>> This causes error(nodes>= 4):
> >>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8
> >>>>>>> (OMP_NUM_THREADS=4)
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Tetsuya Mishima
> >>>>>>>
> >>>>>>>> Oy; that's weird.
> >>>>>>>>
> >>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why
> > that
> >>> is
> >>>>>>> happening -- sorry!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmishima_at_[hidden]>
wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Correa and Jeff,
> >>>>>>>>>
> >>>>>>>>> Thank you for your comments. I quickly checked your suggestion.
> >>>>>>>>>
> >>>>>>>>> As a result, my simple example case worked well.
> >>>>>>>>> export OMP_NUM_THREADS=4
> >>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4
> >>>>>>>>>
> >>>>>>>>> But, practical case that more than 1 process was allocated to a
> >>> node
> >>>>>>> like
> >>>>>>>>> below did not work.
> >>>>>>>>> export OMP_NUM_THREADS=4
> >>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
> >>>>>>>>>
> >>>>>>>>> The error message is as follows:
> >>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
is
> >>>>>>>>> attempting to be sent to a process whose contact infor
> >>>>>>>>> mation is unknown in file rml_oob_send.c at line 316
> >>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address for
> >>>>>>>>> [[30666,0],1]
> >>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
is
> >>>>>>>>> attempting to be sent to a process whose contact infor
> >>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line
123
> >>>>>>>>>
> >>>>>>>>> Here is our openmpi configuration:
> >>>>>>>>> ./configure \
> >>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
> >>>>>>>>> --with-tm \
> >>>>>>>>> --with-verbs \
> >>>>>>>>> --disable-ipv6 \
> >>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
> >>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> >>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> >>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Tetsuya Mishima
> >>>>>>>>>
> >>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo
> >>> Correa<gus_at_[hidden]>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> In your example, have you tried not to modify the node file,
> >>>>>>>>>>> launch two mpi processes with mpiexec, and request a
"-bynode"
> >>>>>>>>> distribution of processes:
> >>>>>>>>>>>
> >>>>>>>>>>> mpiexec -bynode -np 2 ./my_program
> >>>>>>>>>>
> >>>>>>>>>> This should work in 1.7, too (I use these kinds of options
with
> >>>>> SLURM
> >>>>>>> all
> >>>>>>>>> the time).
> >>>>>>>>>>
> >>>>>>>>>> However, we should probably verify that the hostfile
> > functionality
> >>>>> in
> >>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty
> > sure
> >>>>> that
> >>>>>>>>> what you described should work. However, Ralph, our
> >>>>>>>>>> run-time guy, is on vacation this week. There might be a
delay
> > in
> >>>>>>>>> checking into this.
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Jeff Squyres
> >>>>>>>>>> jsquyres_at_[hidden]
> >>>>>>>>>> For corporate legal information go to:
> >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users_at_[hidden]
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Jeff Squyres
> >>>>>>>> jsquyres_at_[hidden]
> >>>>>>>> For corporate legal information go to:
> >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>