Guys,

Many thanks for your replies. It has been confirmed by Torque people that the version 2.5.x has this limitation and they are recommending to upgrade it to 4.2.x.

Qamar Nazir

Best Regards,

Qamar Nazir


 

On 05/16/2013 05:52 PM, Gus Correa wrote:
Hi Qamar

I don't have a cluster as large as yours,
but I know Torque requires special settings for large
clusters:

http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml

My tm_.h (Torque 2.4.11) says:

#define TM_ESYSTEM  17000
#define TM_ENOEVENT  17001
#define TM_ENOTCONNECTED 17002

and TM_ESYSTEM may be sent back by pbs_mom (see mom_comm.c)
if it cannot start the user process.

Have you tried to launch a simple "hostname" command with pbsdsh
on >600 nodes?

Diskless/stateless nodes, if you have them, may present another
challenge (say, regarding /tmp):
http://www.supercluster.org/pipermail/torqueusers/2011-March/012453.html
http://www.open-mpi.org/faq/?category=all#poor-sm-btl-performance
http://www.open-mpi.org/faq/?category=all#network-vs-local

I hope this helps,
Gus Correa

On 05/16/2013 12:21 PM, Ralph Castain wrote:
Check the torque error constants - i'm not sure what that value means,
but torque is reporting the error. all we do is print out the value they
return if it is an error


On May 16, 2013, at 9:09 AM, Qamar Nazir <qnazir@ocf.co.uk
<mailto:qnazir@ocf.co.uk>> wrote:

Dear Support,

We are having an issue with our OMPI runs. When we run jobs on <=550
machines (550 x 16 cores) then they work without any problem. As soon
as we run them on 600 or more machines we get the "plm:tm: failed to
spawn daemon, error code = 17000" Error

We are using:

OpenMPI ver: 1.6.4 (Compiled with GCC v4.4.6)
Torque ver: 2.5.12

The ompi_info's output is attached.


The Environmentstats have been pasted below.


Please assist.


env envsubst
[ocfacc@cyan01 fullrun]$ env
MODULE_VERSION_STACK=3.2.10
OMPI_MCA_mtl=^psm
MANPATH=/local/software/openmpi/1.6.4/gcc/share/man:/local/software/moab/6.1.10/man:/usr/local/share/man:/usr/share/man/overrides:/usr/share/man:/local/Modules/default/share/man
HOSTNAME=cyan01
SHELL=/bin/bash
TERM=xterm
HISTSIZE=1000
QTDIR=/usr/lib64/qt-3.3
OLDPWD=/home/ocfacc/hpl/fullrun/results
QTINC=/usr/lib64/qt-3.3/include
LC_ALL=POSIX
USER=ocfacc
LD_LIBRARY_PATH=/local/software/openmpi/1.6.4/gcc/lib:/local/software/torque/default/lib
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;!
 35:*.xcf=0
1;3
5:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
MPIROOT=/local/software/openmpi/1.6.4/gcc
MODULE_VERSION=3.2.10
MAIL=/var/spool/mail/ocfacc
PATH=/local/software/openmpi/1.6.4/gcc/bin:/usr/lib64/qt-3.3/bin:/local/software/moab/6.1.10/sbin:/local/software/moab/6.1.10/bin:/local/software/torque/default/sbin:/local/software/torque/default/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/home/ocfacc/bin:/local/bin:.
PWD=/home/ocfacc/hpl/fullrun
_LMFILES_=/local/Modules/3.2.10/modulefiles/schedulers/torque/2.5.12:/local/Modules/3.2.10/modulefiles/schedulers/moab/6.1.10:/local/Modules/3.2.10/modulefiles/misc/null:/local/Modules/3.2.10/modulefiles/mpi/openmpi/1.6.4/gcc
LANG=en_US.UTF-8
KDE_IS_PRELINKED=1
MOABHOMEDIR=/local/moab/6.1.10
MODULEPATH=/local/Modules/versions:/local/Modules/modulefiles:/local/Modules/3.2.10/modulefiles/misc:/local/Modules/3.2.10/modulefiles/mpi:/local/Modules/3.2.10/modulefiles/libs:/local/Modules/3.2.10/modulefiles/compilers:/local/Modules/3.2.10/modulefiles/apps:/local/Modules/3.2.10/modulefiles/schedulers
LOADEDMODULES=torque/2.5.12:moab/6.1.10:null:openmpi/1.6.4/gcc
KDEDIRS=/usr
PBS_SERVER=blue101,blue102
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HISTCONTROL=ignoredups
SHLVL=1
HOME=/home/ocfacc
LOGNAME=ocfacc
QTLIB=/usr/lib64/qt-3.3/lib
CVS_RSH=ssh
LC_CTYPE=POSIX
MODULESHOME=/local/Modules/3.2.10
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
module=() { eval `/local/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
_=/bin/env









--

Best Regards,

*Qamar Nazir*




_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users