Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
From: tmishima_at_[hidden]
Date: 2013-03-20 22:47:35


Hi Ralph,

I have completed rebuild of openmpi1.7rc8.
To save time, I added --disable-vt. ( Is it OK? )

Well, what shall I do ?

./configure \
--prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
--with-tm \
--with-verbs \
--disable-ipv6 \
--disable-vt \
--enable-debug \
CC=pgcc CFLAGS="-fast -tp k8-64e" \
CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
F77=pgfortran FFLAGS="-fast -tp k8-64e" \
FC=pgfortran FCFLAGS="-fast -tp k8-64e"

Note:
I tried patch user.diff after rebuiding openmpi1.7rc8.
But, I got an error and could not go foward.

$ patch -p0 < user.diff # this is OK
$ make # I got an error

  CC util/hostfile/hostfile.lo
PGC-S-0037-Syntax error: Recovery attempted by deleting <string>
(util/hostfile/hostfile.c: 728)
PGC/x86-64 Linux 12.9-0: compilation completed with severe errors

Regards,
Tetsuya Mishima

> Could you please apply the attached patch and try it again? If you
haven't had time to configure with --enable-debug, that is fine - this will
output regardless.
>
> Thanks
> Ralph
>
> - user.diff
>
>
> On Mar 20, 2013, at 4:59 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> > You obviously have some MCA params set somewhere:
> >
> >>
--------------------------------------------------------------------------
> >> A deprecated MCA parameter value was specified in an MCA parameter
> >> file. Deprecated MCA parameters should be avoided; they may disappear
> >> in future releases.
> >>
> >> Deprecated parameter: orte_rsh_agent
> >>
--------------------------------------------------------------------------
> >
> > Check your environment for anything with OMPI_MCA_xxx, and your default
MCA parameter file to see what has been specified.
> >
> > The allocation looks okay - I'll have to look for other debug flags you
can set. Meantime, can you please add --enable-debug to your configure cmd
line and rebuild?
> >
> > Thanks
> > Ralph
> >
> >
> > On Mar 20, 2013, at 4:39 PM, tmishima_at_[hidden] wrote:
> >
> >>
> >>
> >> Hi Ralph,
> >>
> >> Here is a result of rerun with --display-allocation.
> >> I set OMP_NUM_THREADS=1 to make the problem clear.
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >> P.S. As far as I checked, these 2 cases are OK(no problem).
> >> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
> >> ~/Ducom/testbed/mPre m02-ld
> >> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation
~/Ducom/testbed/mPre
> >> m02-ld
> >>
> >> Script File:
> >>
> >> #!/bin/sh
> >> #PBS -A tmishima
> >> #PBS -N Ducom-run
> >> #PBS -j oe
> >> #PBS -l nodes=2:ppn=4
> >> export OMP_NUM_THREADS=1
> >> cd $PBS_O_WORKDIR
> >> cp $PBS_NODEFILE pbs_hosts
> >> NPROCS=`wc -l < pbs_hosts`
> >> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
> >> --display-allocation ~/Ducom/testbed/mPre m02-ld
> >>
> >> Output:
> >>
--------------------------------------------------------------------------
> >> A deprecated MCA parameter value was specified in an MCA parameter
> >> file. Deprecated MCA parameters should be avoided; they may disappear
> >> in future releases.
> >>
> >> Deprecated parameter: orte_rsh_agent
> >>
--------------------------------------------------------------------------
> >>
> >> ====================== ALLOCATED NODES ======================
> >>
> >> Data for node: node06 Num slots: 4 Max slots: 0
> >> Data for node: node05 Num slots: 4 Max slots: 0
> >>
> >> =================================================================
> >>
--------------------------------------------------------------------------
> >> A hostfile was provided that contains at least one node not
> >> present in the allocation:
> >>
> >> hostfile: pbs_hosts
> >> node: node06
> >>
> >> If you are operating in a resource-managed environment, then only
> >> nodes that are in the allocation can be used in the hostfile. You
> >> may find relative node syntax to be a useful alternative to
> >> specifying absolute node names see the orte_hosts man page for
> >> further information.
> >>
--------------------------------------------------------------------------
> >>
> >>
> >>> I've submitted a patch to fix the Torque launch issue - just some
> >> leftover garbage that existed at the time of the 1.7.0 branch and
didn't
> >> get removed.
> >>>
> >>> For the hostfile issue, I'm stumped as I can't see how the problem
would
> >> come about. Could you please rerun your original test and add
> >> "--display-allocation" to your cmd line? Let's see if it is
> >>> correctly finding the original allocation.
> >>>
> >>> Thanks
> >>> Ralph
> >>>
> >>> On Mar 19, 2013, at 5:08 PM, tmishima_at_[hidden] wrote:
> >>>
> >>>>
> >>>>
> >>>> Hi Gus,
> >>>>
> >>>> Thank you for your comments. I understand your advice.
> >>>> Our script used to be --npernode type as well.
> >>>>
> >>>> As I told before, our cluster consists of nodes having 4, 8,
> >>>> and 32 cores, although it used to be homogeneous at the
> >>>> starting time. Furthermore, since performance of each core
> >>>> is almost same, a mixed use of nodes with different number
> >>>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
> >>>>
> >>>> --npernode type is not applicable to such a mixed use.
> >>>> That's why I'd like to continue to use modified hostfile.
> >>>>
> >>>> By the way, the problem I reported to Jeff yesterday
> >>>> was that openmpi-1.7 with torque is something wrong,
> >>>> because it caused error against such a simple case as
> >>>> shown below, which surprised me. Now, the problem is not
> >>>> limited to modified hostfile, I guess.
> >>>>
> >>>> #PBS -l nodes=4:ppn=8
> >>>> mpirun -np 8 ./my_program
> >>>> (OMP_NUM_THREADS=4)
> >>>>
> >>>> Regards,
> >>>> Tetsuya Mishima
> >>>>
> >>>>> Hi Tetsuya
> >>>>>
> >>>>> Your script that edits $PBS_NODEFILE into a separate hostfile
> >>>>> is very similar to some that I used here for
> >>>>> hybrid OpenMP+MPI programs on older versions of OMPI.
> >>>>> I haven't tried this in 1.6.X,
> >>>>> but it looks like you did and it works also.
> >>>>> I haven't tried 1.7 either.
> >>>>> Since we run production machines,
> >>>>> I try to stick to the stable versions of OMPI (even numbered:
> >>>>> 1.6.X, 1.4.X, 1.2.X).
> >>>>>
> >>>>> I believe you can get the same effect even if you
> >>>>> don't edit your $PBS_NODEFILE and let OMPI use it as is.
> >>>>> Say, if you choose carefully the values in your
> >>>>> #PBS -l nodes=?:ppn=?
> >>>>> of your
> >>>>> $OMP_NUM_THREADS
> >>>>> and use an mpiexec with --npernode or --cpus-per-proc.
> >>>>>
> >>>>> For instance, for twelve MPI processes, with two threads each,
> >>>>> on nodes with eight cores each, I would try
> >>>>> (but I haven't tried!):
> >>>>>
> >>>>> #PBS -l nodes=3:ppn=8
> >>>>>
> >>>>> export $OMP_NUM_THREADS=2
> >>>>>
> >>>>> mpiexec -np 12 -npernode 4
> >>>>>
> >>>>> or perhaps more tightly:
> >>>>>
> >>>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2
> >>>>>
> >>>>> I hope this helps,
> >>>>> Gus Correa
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 03/19/2013 03:12 PM, tmishima_at_[hidden] wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Reuti and Gus,
> >>>>>>
> >>>>>> Thank you for your comments.
> >>>>>>
> >>>>>> Our cluster is a little bit heterogeneous, which has nodes with 4,
8,
> >>>> 32
> >>>>>> cores.
> >>>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for
"-l
> >>>>>> nodes=2:ppn=4".
> >>>>>> (strictly speaking, Torque picked up proper nodes.)
> >>>>>>
> >>>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no
> >> troble
> >>>>>> against that kind
> >>>>>> of use. I encountered the issue when I was evaluating openmpi-1.7
to
> >>>> check
> >>>>>> when we could
> >>>>>> move on to it, although we have no positive reason to do that at
this
> >>>>>> moment.
> >>>>>>
> >>>>>> As Gus pointed out, I use a script file as shown below for a
> >> practical
> >>>> use
> >>>>>> of openmpi-1.6.x.
> >>>>>>
> >>>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works
fine
> >>>>>> export OMP_NUM_THREADS=4
> >>>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16
lines
> >>>> here
> >>>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4
-report-bindings \
> >>>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes,
> >> 8-core
> >>>>>> node has 2 numanodes
> >>>>>>
> >>>>>> It works well under the combination of openmpi-1.6.x and Torque.
The
> >>>>>> problem is just
> >>>>>> openmpi-1.7's behavior.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Tetsuya Mishima
> >>>>>>
> >>>>>>> Hi Tetsuya Mishima
> >>>>>>>
> >>>>>>> Mpiexec offers you a number of possibilities that you could try:
> >>>>>>> --bynode,
> >>>>>>> --pernode,
> >>>>>>> --npernode,
> >>>>>>> --bysocket,
> >>>>>>> --bycore,
> >>>>>>> --cpus-per-proc,
> >>>>>>> --cpus-per-rank,
> >>>>>>> --rankfile
> >>>>>>> and more.
> >>>>>>>
> >>>>>>> Most likely one or more of them will fit your needs.
> >>>>>>>
> >>>>>>> There are also associated flags to bind processes to cores,
> >>>>>>> to sockets, etc, to report the bindings, and so on.
> >>>>>>>
> >>>>>>> Check the mpiexec man page for details.
> >>>>>>>
> >>>>>>> Nevertheless, I am surprised that modifying the
> >>>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7.
> >>>>>>> I have done this many times in older versions of OMPI.
> >>>>>>>
> >>>>>>> Would it work for you to go back to the stable OMPI 1.6.X,
> >>>>>>> or does it lack any special feature that you need?
> >>>>>>>
> >>>>>>> I hope this helps,
> >>>>>>> Gus Correa
> >>>>>>>
> >>>>>>> On 03/19/2013 03:00 AM, tmishima_at_[hidden] wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jeff,
> >>>>>>>>
> >>>>>>>> I didn't have much time to test this morning. So, I checked it
> >> again
> >>>>>>>> now. Then, the trouble seems to depend on the number of nodes to
> >> use.
> >>>>>>>>
> >>>>>>>> This works(nodes< 4):
> >>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
> >>>>>>>> (OMP_NUM_THREADS=4)
> >>>>>>>>
> >>>>>>>> This causes error(nodes>= 4):
> >>>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8
> >>>>>>>> (OMP_NUM_THREADS=4)
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Tetsuya Mishima
> >>>>>>>>
> >>>>>>>>> Oy; that's weird.
> >>>>>>>>>
> >>>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why
> >> that
> >>>> is
> >>>>>>>> happening -- sorry!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmishima_at_[hidden]>
wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Correa and Jeff,
> >>>>>>>>>>
> >>>>>>>>>> Thank you for your comments. I quickly checked your
suggestion.
> >>>>>>>>>>
> >>>>>>>>>> As a result, my simple example case worked well.
> >>>>>>>>>> export OMP_NUM_THREADS=4
> >>>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4
> >>>>>>>>>>
> >>>>>>>>>> But, practical case that more than 1 process was allocated to
a
> >>>> node
> >>>>>>>> like
> >>>>>>>>>> below did not work.
> >>>>>>>>>> export OMP_NUM_THREADS=4
> >>>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
> >>>>>>>>>>
> >>>>>>>>>> The error message is as follows:
> >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
is
> >>>>>>>>>> attempting to be sent to a process whose contact infor
> >>>>>>>>>> mation is unknown in file rml_oob_send.c at line 316
> >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address
for
> >>>>>>>>>> [[30666,0],1]
> >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
is
> >>>>>>>>>> attempting to be sent to a process whose contact infor
> >>>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line
123
> >>>>>>>>>>
> >>>>>>>>>> Here is our openmpi configuration:
> >>>>>>>>>> ./configure \
> >>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
> >>>>>>>>>> --with-tm \
> >>>>>>>>>> --with-verbs \
> >>>>>>>>>> --disable-ipv6 \
> >>>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
> >>>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> >>>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> >>>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>
> >>>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo
> >>>> Correa<gus_at_[hidden]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> In your example, have you tried not to modify the node file,
> >>>>>>>>>>>> launch two mpi processes with mpiexec, and request a
"-bynode"
> >>>>>>>>>> distribution of processes:
> >>>>>>>>>>>>
> >>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program
> >>>>>>>>>>>
> >>>>>>>>>>> This should work in 1.7, too (I use these kinds of options
with
> >>>>>> SLURM
> >>>>>>>> all
> >>>>>>>>>> the time).
> >>>>>>>>>>>
> >>>>>>>>>>> However, we should probably verify that the hostfile
> >> functionality
> >>>>>> in
> >>>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty
> >> sure
> >>>>>> that
> >>>>>>>>>> what you described should work. However, Ralph, our
> >>>>>>>>>>> run-time guy, is on vacation this week. There might be a
delay
> >> in
> >>>>>>>>>> checking into this.
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Jeff Squyres
> >>>>>>>>>>> jsquyres_at_[hidden]
> >>>>>>>>>>> For corporate legal information go to:
> >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users_at_[hidden]
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jeff Squyres
> >>>>>>>>> jsquyres_at_[hidden]
> >>>>>>>>> For corporate legal information go to:
> >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users