Check the torque error constants - i'm not sure what that value means, but torque is reporting the error. all we do is print out the value they return if it is an error


On May 16, 2013, at 9:09 AM, Qamar Nazir <qnazir@ocf.co.uk> wrote:

Dear Support,

We are having an issue with our OMPI runs. When we run jobs on <=550 machines (550 x 16 cores) then they work without any problem. As soon as we run them on 600 or more machines we get the "plm:tm: failed to spawn daemon, error code = 17000" Error

We are using:

OpenMPI ver: 1.6.4 (Compiled with GCC v4.4.6)
Torque ver: 2.5.12

The ompi_info's output is attached.


The Environment stats have been pasted below.


Please assist.


env       envsubst 
[ocfacc@cyan01 fullrun]$ env
MODULE_VERSION_STACK=3.2.10
OMPI_MCA_mtl=^psm
MANPATH=/local/software/openmpi/1.6.4/gcc/share/man:/local/software/moab/6.1.10/man:/usr/local/share/man:/usr/share/man/overrides:/usr/share/man:/local/Modules/default/share/man
HOSTNAME=cyan01
SHELL=/bin/bash
TERM=xterm
HISTSIZE=1000
QTDIR=/usr/lib64/qt-3.3
OLDPWD=/home/ocfacc/hpl/fullrun/results
QTINC=/usr/lib64/qt-3.3/include
LC_ALL=POSIX
USER=ocfacc
LD_LIBRARY_PATH=/local/software/openmpi/1.6.4/gcc/lib:/local/software/torque/default/lib
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;3 5:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
MPIROOT=/local/software/openmpi/1.6.4/gcc
MODULE_VERSION=3.2.10
MAIL=/var/spool/mail/ocfacc
PATH=/local/software/openmpi/1.6.4/gcc/bin:/usr/lib64/qt-3.3/bin:/local/software/moab/6.1.10/sbin:/local/software/moab/6.1.10/bin:/local/software/torque/default/sbin:/local/software/torque/default/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/home/ocfacc/bin:/local/bin:.
PWD=/home/ocfacc/hpl/fullrun
_LMFILES_=/local/Modules/3.2.10/modulefiles/schedulers/torque/2.5.12:/local/Modules/3.2.10/modulefiles/schedulers/moab/6.1.10:/local/Modules/3.2.10/modulefiles/misc/null:/local/Modules/3.2.10/modulefiles/mpi/openmpi/1.6.4/gcc
LANG=en_US.UTF-8
KDE_IS_PRELINKED=1
MOABHOMEDIR=/local/moab/6.1.10
MODULEPATH=/local/Modules/versions:/local/Modules/modulefiles:/local/Modules/3.2.10/modulefiles/misc:/local/Modules/3.2.10/modulefiles/mpi:/local/Modules/3.2.10/modulefiles/libs:/local/Modules/3.2.10/modulefiles/compilers:/local/Modules/3.2.10/modulefiles/apps:/local/Modules/3.2.10/modulefiles/schedulers
LOADEDMODULES=torque/2.5.12:moab/6.1.10:null:openmpi/1.6.4/gcc
KDEDIRS=/usr
PBS_SERVER=blue101,blue102
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HISTCONTROL=ignoredups
SHLVL=1
HOME=/home/ocfacc
LOGNAME=ocfacc
QTLIB=/usr/lib64/qt-3.3/lib
CVS_RSH=ssh
LC_CTYPE=POSIX
MODULESHOME=/local/Modules/3.2.10
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
module=() {  eval `/local/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
_=/bin/env









--
Qamar Nazir

Best Regards,

Qamar Nazir

HPC Software Engineer

OCF plc

 

Tel: 0114 257 2200        Twitter

Fax: 0114 257 0022       Blog

Mob: 07508 033895      Web

 

OCF plc is a company registered in England and Wales.  Registered number 4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, Thorncliffe Park, Chapeltown, Sheffield, S35 2PG

 

Please note, any emails relating to an OCF Support request must always be sent to support@ocf.co.uk for a ticket number to be generated or existing support ticket to be updated. Should this not be done then OCF cannot be held responsible for requests not dealt with in a timely manner.

 

This message is private and confidential. If you have received this message in error, please notify us immediately and remove it from your system.

 
<ompi_info.txt.bz2>_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users