Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] plm:tm: failed to spawn daemon, error code = 17000 Error when running jobs on 600 or more nodes
From: Qamar Nazir (qnazir_at_[hidden])
Date: 2013-05-16 12:09:19


Dear Support,

We are having an issue with our OMPI runs. When we run jobs on <=550
machines (550 x 16 cores) then they work without any problem. As soon as
we run them on 600 or more machines we get the "plm:tm: failed to spawn
daemon, error code = 17000" Error

We are using:

OpenMPI ver: 1.6.4 (Compiled with GCC v4.4.6)
Torque ver: 2.5.12

The ompi_info's output is attached.

The Environmentstats have been pasted below.

Please assist.

env envsubst
[ocfacc_at_cyan01 fullrun]$ env
MODULE_VERSION_STACK=3.2.10
OMPI_MCA_mtl=^psm
MANPATH=/local/software/openmpi/1.6.4/gcc/share/man:/local/software/moab/6.1.10/man:/usr/local/share/man:/usr/share/man/overrides:/usr/share/man:/local/Modules/default/share/man
HOSTNAME=cyan01
SHELL=/bin/bash
TERM=xterm
HISTSIZE=1000
QTDIR=/usr/lib64/qt-3.3
OLDPWD=/home/ocfacc/hpl/fullrun/results
QTINC=/usr/lib64/qt-3.3/include
LC_ALL=POSIX
USER=ocfacc
LD_LIBRARY_PATH=/local/software/openmpi/1.6.4/gcc/lib:/local/software/torque/default/lib
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
MPIROOT=/local/software/openmpi/1.6.4/gcc
MODULE_VERSION=3.2.10
MAIL=/var/spool/mail/ocfacc
PATH=/local/software/openmpi/1.6.4/gcc/bin:/usr/lib64/qt-3.3/bin:/local/software/moab/6.1.10/sbin:/local/software/moab/6.1.10/bin:/local/software/torque/default/sbin:/local/software/torque/default/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/home/ocfacc/bin:/local/bin:.
PWD=/home/ocfacc/hpl/fullrun
_LMFILES_=/local/Modules/3.2.10/modulefiles/schedulers/torque/2.5.12:/local/Modules/3.2.10/modulefiles/schedulers/moab/6.1.10:/local/Modules/3.2.10/modulefiles/misc/null:/local/Modules/3.2.10/modulefiles/mpi/openmpi/1.6.4/gcc
LANG=en_US.UTF-8
KDE_IS_PRELINKED=1
MOABHOMEDIR=/local/moab/6.1.10
MODULEPATH=/local/Modules/versions:/local/Modules/modulefiles:/local/Modules/3.2.10/modulefiles/misc:/local/Modules/3.2.10/modulefiles/mpi:/local/Modules/3.2.10/modulefiles/libs:/local/Modules/3.2.10/modulefiles/compilers:/local/Modules/3.2.10/modulefiles/apps:/local/Modules/3.2.10/modulefiles/schedulers
LOADEDMODULES=torque/2.5.12:moab/6.1.10:null:openmpi/1.6.4/gcc
KDEDIRS=/usr
PBS_SERVER=blue101,blue102
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HISTCONTROL=ignoredups
SHLVL=1
HOME=/home/ocfacc
LOGNAME=ocfacc
QTLIB=/usr/lib64/qt-3.3/lib
CVS_RSH=ssh
LC_CTYPE=POSIX
MODULESHOME=/local/Modules/3.2.10
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
module=() { eval `/local/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
_=/bin/env

-- 
Qamar Nazir
Best Regards,
*Qamar Nazir*
HPC Software Engineer
OCF plc
*Tel:*0114 257 2200 Twitter <http://twitter.com/ocfplc>
*Fax:*0114 257 0022 Blog <http://blog.ocf.co.uk/>
*Mob:*07508 033895 Web <http://www.ocf.co.uk/>
OCF plc is a company registered in England and Wales.  Registered number 
4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, 
Thorncliffe Park, Chapeltown, Sheffield, S35 2PG
Please note, any emails relating to an OCF Support request must always 
be sent to support_at_[hidden] <mailto:support_at_[hidden]>for a ticket 
number to be generated or existing support ticket to be updated. Should 
this not be done then OCF cannot be held responsible for requests not 
dealt with in a timely manner.
This message is private and confidential. If you have received this 
message in error, please notify us immediately and remove it from your 
system.