Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
From: tmishima_at_[hidden]
Date: 2013-03-20 22:47:35


Hi Ralph,

I have completed rebuild of openmpi1.7rc8.
To save time, I added --disable-vt. ( Is it OK? )

Well, what shall I do ?

./configure \
--prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
--with-tm \
--with-verbs \
--disable-ipv6 \
--disable-vt \
--enable-debug \
CC=pgcc CFLAGS="-fast -tp k8-64e" \
CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
F77=pgfortran FFLAGS="-fast -tp k8-64e" \
FC=pgfortran FCFLAGS="-fast -tp k8-64e"

Note:
I tried patch user.diff after rebuiding openmpi1.7rc8.
But, I got an error and could not go foward.

$ patch -p0 < user.diff # this is OK
$ make # I got an error

  CC util/hostfile/hostfile.lo
PGC-S-0037-Syntax error: Recovery attempted by deleting <string>
(util/hostfile/hostfile.c: 728)
PGC/x86-64 Linux 12.9-0: compilation completed with severe errors

Regards,
Tetsuya Mishima

> Could you please apply the attached patch and try it again? If you
haven't had time to configure with --enable-debug, that is fine - this will
output regardless.
>
> Thanks
> Ralph
>
> - user.diff
>
>
> On Mar 20, 2013, at 4:59 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> > You obviously have some MCA params set somewhere:
> >
> >>
--------------------------------------------------------------------------
> >> A deprecated MCA parameter value was specified in an MCA parameter
> >> file. Deprecated MCA parameters should be avoided; they may disappear
> >> in future releases.
> >>
> >> Deprecated parameter: orte_rsh_agent
> >>
--------------------------------------------------------------------------
> >
> > Check your environment for anything with OMPI_MCA_xxx, and your default
MCA parameter file to see what has been specified.
> >
> > The allocation looks okay - I'll have to look for other debug flags you
can set. Meantime, can you please add --enable-debug to your configure cmd
line and rebuild?
> >
> > Thanks
> > Ralph
> >
> >
> > On Mar 20, 2013, at 4:39 PM, tmishima_at_[hidden] wrote:
> >
> >>
> >>
> >> Hi Ralph,
> >>
> >> Here is a result of rerun with --display-allocation.
> >> I set OMP_NUM_THREADS=1 to make the problem clear.
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >> P.S. As far as I checked, these 2 cases are OK(no problem).
> >> (1)mpirun -v -np $NPROCS-x OMP_NUM_THREADS --display-allocation
> >> ~/Ducom/testbed/mPre m02-ld
> >> (2)mpirun -v -x OMP_NUM_THREADS --display-allocation
~/Ducom/testbed/mPre
> >> m02-ld
> >>
> >> Script File:
> >>
> >> #!/bin/sh
> >> #PBS -A tmishima
> >> #PBS -N Ducom-run
> >> #PBS -j oe
> >> #PBS -l nodes=2:ppn=4
> >> export OMP_NUM_THREADS=1
> >> cd $PBS_O_WORKDIR
> >> cp $PBS_NODEFILE pbs_hosts
> >> NPROCS=`wc -l < pbs_hosts`
> >> mpirun -v -np $NPROCS -hostfile pbs_hosts -x OMP_NUM_THREADS
> >> --display-allocation ~/Ducom/testbed/mPre m02-ld
> >>
> >> Output:
> >>
--------------------------------------------------------------------------
> >> A deprecated MCA parameter value was specified in an MCA parameter
> >> file. Deprecated MCA parameters should be avoided; they may disappear
> >> in future releases.
> >>
> >> Deprecated parameter: orte_rsh_agent
> >>
--------------------------------------------------------------------------
> >>
> >> ====================== ALLOCATED NODES ======================
> >>
> >> Data for node: node06 Num slots: 4 Max slots: 0
> >> Data for node: node05 Num slots: 4 Max slots: 0
> >>
> >> =================================================================
> >>
--------------------------------------------------------------------------
> >> A hostfile was provided that contains at least one node not
> >> present in the allocation:
> >>
> >> hostfile: pbs_hosts
> >> node: node06
> >>
> >> If you are operating in a resource-managed environment, then only
> >> nodes that are in the allocation can be used in the hostfile. You
> >> may find relative node syntax to be a useful alternative to
> >> specifying absolute node names see the orte_hosts man page for
> >> further information.
> >>
--------------------------------------------------------------------------
> >>
> >>
> >>> I've submitted a patch to fix the Torque launch issue - just some
> >> leftover garbage that existed at the time of the 1.7.0 branch and
didn't
> >> get removed.
> >>>
> >>> For the hostfile issue, I'm stumped as I can't see how the problem
would
> >> come about. Could you please rerun your original test and add
> >> "--display-allocation" to your cmd line? Let's see if it is
> >>> correctly finding the original allocation.
> >>>
> >>> Thanks
> >>> Ralph
> >>>
> >>> On Mar 19, 2013, at 5:08 PM, tmishima_at_[hidden] wrote:
> >>>
> >>>>
> >>>>
> >>>> Hi Gus,
> >>>>
> >>>> Thank you for your comments. I understand your advice.
> >>>> Our script used to be --npernode type as well.
> >>>>
> >>>> As I told before, our cluster consists of nodes having 4, 8,
> >>>> and 32 cores, although it used to be homogeneous at the
> >>>> starting time. Furthermore, since performance of each core
> >>>> is almost same, a mixed use of nodes with different number
> >>>> of cores is possible, just like #PBS -l nodes=1:ppn=32+4:ppn=8.
> >>>>
> >>>> --npernode type is not applicable to such a mixed use.
> >>>> That's why I'd like to continue to use modified hostfile.
> >>>>
> >>>> By the way, the problem I reported to Jeff yesterday
> >>>> was that openmpi-1.7 with torque is something wrong,
> >>>> because it caused error against such a simple case as
> >>>> shown below, which surprised me. Now, the problem is not
> >>>> limited to modified hostfile, I guess.
> >>>>
> >>>> #PBS -l nodes=4:ppn=8
> >>>> mpirun -np 8 ./my_program
> >>>> (OMP_NUM_THREADS=4)
> >>>>
> >>>> Regards,
> >>>> Tetsuya Mishima
> >>>>
> >>>>> Hi Tetsuya
> >>>>>
> >>>>> Your script that edits $PBS_NODEFILE into a separate hostfile
> >>>>> is very similar to some that I used here for
> >>>>> hybrid OpenMP+MPI programs on older versions of OMPI.
> >>>>> I haven't tried this in 1.6.X,
> >>>>> but it looks like you did and it works also.
> >>>>> I haven't tried 1.7 either.
> >>>>> Since we run production machines,
> >>>>> I try to stick to the stable versions of OMPI (even numbered:
> >>>>> 1.6.X, 1.4.X, 1.2.X).
> >>>>>
> >>>>> I believe you can get the same effect even if you
> >>>>> don't edit your $PBS_NODEFILE and let OMPI use it as is.
> >>>>> Say, if you choose carefully the values in your
> >>>>> #PBS -l nodes=?:ppn=?
> >>>>> of your
> >>>>> $OMP_NUM_THREADS
> >>>>> and use an mpiexec with --npernode or --cpus-per-proc.
> >>>>>
> >>>>> For instance, for twelve MPI processes, with two threads each,
> >>>>> on nodes with eight cores each, I would try
> >>>>> (but I haven't tried!):
> >>>>>
> >>>>> #PBS -l nodes=3:ppn=8
> >>>>>
> >>>>> export $OMP_NUM_THREADS=2
> >>>>>
> >>>>> mpiexec -np 12 -npernode 4
> >>>>>
> >>>>> or perhaps more tightly:
> >>>>>
> >>>>> mpiexec -np 12 --report-bindings --bind-to-core --cpus-per-proc 2
> >>>>>
> >>>>> I hope this helps,
> >>>>> Gus Correa
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 03/19/2013 03:12 PM, tmishima_at_[hidden] wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Reuti and Gus,
> >>>>>>
> >>>>>> Thank you for your comments.
> >>>>>>
> >>>>>> Our cluster is a little bit heterogeneous, which has nodes with 4,
8,
> >>>> 32
> >>>>>> cores.
> >>>>>> I used 8-core nodes for "-l nodes=4:ppn=8" and 4-core nodes for
"-l
> >>>>>> nodes=2:ppn=4".
> >>>>>> (strictly speaking, Torque picked up proper nodes.)
> >>>>>>
> >>>>>> As I mentioned before, I usually use openmpi-1.6.x, which has no
> >> troble
> >>>>>> against that kind
> >>>>>> of use. I encountered the issue when I was evaluating openmpi-1.7
to
> >>>> check
> >>>>>> when we could
> >>>>>> move on to it, although we have no positive reason to do that at
this
> >>>>>> moment.
> >>>>>>
> >>>>>> As Gus pointed out, I use a script file as shown below for a
> >> practical
> >>>> use
> >>>>>> of openmpi-1.6.x.
> >>>>>>
> >>>>>> #PBS -l nodes=2:ppn=32 # even "-l nodes=1:ppn=32+4:ppn=8" works
fine
> >>>>>> export OMP_NUM_THREADS=4
> >>>>>> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16
lines
> >>>> here
> >>>>>> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4
-report-bindings \
> >>>>>> -x OMP_NUM_THREADS ./my_program # 32-core node has 8 numanodes,
> >> 8-core
> >>>>>> node has 2 numanodes
> >>>>>>
> >>>>>> It works well under the combination of openmpi-1.6.x and Torque.
The
> >>>>>> problem is just
> >>>>>> openmpi-1.7's behavior.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Tetsuya Mishima
> >>>>>>
> >>>>>>> Hi Tetsuya Mishima
> >>>>>>>
> >>>>>>> Mpiexec offers you a number of possibilities that you could try:
> >>>>>>> --bynode,
> >>>>>>> --pernode,
> >>>>>>> --npernode,
> >>>>>>> --bysocket,
> >>>>>>> --bycore,
> >>>>>>> --cpus-per-proc,
> >>>>>>> --cpus-per-rank,
> >>>>>>> --rankfile
> >>>>>>> and more.
> >>>>>>>
> >>>>>>> Most likely one or more of them will fit your needs.
> >>>>>>>
> >>>>>>> There are also associated flags to bind processes to cores,
> >>>>>>> to sockets, etc, to report the bindings, and so on.
> >>>>>>>
> >>>>>>> Check the mpiexec man page for details.
> >>>>>>>
> >>>>>>> Nevertheless, I am surprised that modifying the
> >>>>>>> $PBS_NODEFILE doesn't work for you in OMPI 1.7.
> >>>>>>> I have done this many times in older versions of OMPI.
> >>>>>>>
> >>>>>>> Would it work for you to go back to the stable OMPI 1.6.X,
> >>>>>>> or does it lack any special feature that you need?
> >>>>>>>
> >>>>>>> I hope this helps,
> >>>>>>> Gus Correa
> >>>>>>>
> >>>>>>> On 03/19/2013 03:00 AM, tmishima_at_[hidden] wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jeff,
> >>>>>>>>
> >>>>>>>> I didn't have much time to test this morning. So, I checked it
> >> again
> >>>>>>>> now. Then, the trouble seems to depend on the number of nodes to
> >> use.
> >>>>>>>>
> >>>>>>>> This works(nodes< 4):
> >>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
> >>>>>>>> (OMP_NUM_THREADS=4)
> >>>>>>>>
> >>>>>>>> This causes error(nodes>= 4):
> >>>>>>>> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8
> >>>>>>>> (OMP_NUM_THREADS=4)
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Tetsuya Mishima
> >>>>>>>>
> >>>>>>>>> Oy; that's weird.
> >>>>>>>>>
> >>>>>>>>> I'm afraid we're going to have to wait for Ralph to answer why
> >> that
> >>>> is
> >>>>>>>> happening -- sorry!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mar 18, 2013, at 4:45 PM,<tmishima_at_[hidden]>
wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Correa and Jeff,
> >>>>>>>>>>
> >>>>>>>>>> Thank you for your comments. I quickly checked your
suggestion.
> >>>>>>>>>>
> >>>>>>>>>> As a result, my simple example case worked well.
> >>>>>>>>>> export OMP_NUM_THREADS=4
> >>>>>>>>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4
> >>>>>>>>>>
> >>>>>>>>>> But, practical case that more than 1 process was allocated to
a
> >>>> node
> >>>>>>>> like
> >>>>>>>>>> below did not work.
> >>>>>>>>>> export OMP_NUM_THREADS=4
> >>>>>>>>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
> >>>>>>>>>>
> >>>>>>>>>> The error message is as follows:
> >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
is
> >>>>>>>>>> attempting to be sent to a process whose contact infor
> >>>>>>>>>> mation is unknown in file rml_oob_send.c at line 316
> >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] unable to find address
for
> >>>>>>>>>> [[30666,0],1]
> >>>>>>>>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message
is
> >>>>>>>>>> attempting to be sent to a process whose contact infor
> >>>>>>>>>> mation is unknown in file base/grpcomm_base_rollup.c at line
123
> >>>>>>>>>>
> >>>>>>>>>> Here is our openmpi configuration:
> >>>>>>>>>> ./configure \
> >>>>>>>>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
> >>>>>>>>>> --with-tm \
> >>>>>>>>>> --with-verbs \
> >>>>>>>>>> --disable-ipv6 \
> >>>>>>>>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
> >>>>>>>>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
> >>>>>>>>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
> >>>>>>>>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>
> >>>>>>>>>>> On Mar 17, 2013, at 10:55 PM, Gustavo
> >>>> Correa<gus_at_[hidden]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> In your example, have you tried not to modify the node file,
> >>>>>>>>>>>> launch two mpi processes with mpiexec, and request a
"-bynode"
> >>>>>>>>>> distribution of processes:
> >>>>>>>>>>>>
> >>>>>>>>>>>> mpiexec -bynode -np 2 ./my_program
> >>>>>>>>>>>
> >>>>>>>>>>> This should work in 1.7, too (I use these kinds of options
with
> >>>>>> SLURM
> >>>>>>>> all
> >>>>>>>>>> the time).
> >>>>>>>>>>>
> >>>>>>>>>>> However, we should probably verify that the hostfile
> >> functionality
> >>>>>> in
> >>>>>>>>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty
> >> sure
> >>>>>> that
> >>>>>>>>>> what you described should work. However, Ralph, our
> >>>>>>>>>>> run-time guy, is on vacation this week. There might be a
delay
> >> in
> >>>>>>>>>> checking into this.
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Jeff Squyres
> >>>>>>>>>>> jsquyres_at_[hidden]
> >>>>>>>>>>> For corporate legal information go to:
> >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users_at_[hidden]
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jeff Squyres
> >>>>>>>>> jsquyres_at_[hidden]
> >>>>>>>>> For corporate legal information go to:
> >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users