Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] plm:tm: failed to spawn daemon, error code = 17000 Error when running jobs on 600 or more nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-05-16 12:21:28


Check the torque error constants - i'm not sure what that value means, but torque is reporting the error. all we do is print out the value they return if it is an error

On May 16, 2013, at 9:09 AM, Qamar Nazir <qnazir_at_[hidden]> wrote:

> Dear Support,
>
> We are having an issue with our OMPI runs. When we run jobs on <=550 machines (550 x 16 cores) then they work without any problem. As soon as we run them on 600 or more machines we get the "plm:tm: failed to spawn daemon, error code = 17000" Error
>
> We are using:
>
> OpenMPI ver: 1.6.4 (Compiled with GCC v4.4.6)
> Torque ver: 2.5.12
>
> The ompi_info's output is attached.
>
>
> The Environment stats have been pasted below.
>
>
> Please assist.
>
>
> env envsubst
> [ocfacc_at_cyan01 fullrun]$ env
> MODULE_VERSION_STACK=3.2.10
> OMPI_MCA_mtl=^psm
> MANPATH=/local/software/openmpi/1.6.4/gcc/share/man:/local/software/moab/6.1.10/man:/usr/local/share/man:/usr/share/man/overrides:/usr/share/man:/local/Modules/default/share/man
> HOSTNAME=cyan01
> SHELL=/bin/bash
> TERM=xterm
> HISTSIZE=1000
> QTDIR=/usr/lib64/qt-3.3
> OLDPWD=/home/ocfacc/hpl/fullrun/results
> QTINC=/usr/lib64/qt-3.3/include
> LC_ALL=POSIX
> USER=ocfacc
> LD_LIBRARY_PATH=/local/software/openmpi/1.6.4/gcc/lib:/local/software/torque/default/lib
> LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;3 5:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
> MPIROOT=/local/software/openmpi/1.6.4/gcc
> MODULE_VERSION=3.2.10
> MAIL=/var/spool/mail/ocfacc
> PATH=/local/software/openmpi/1.6.4/gcc/bin:/usr/lib64/qt-3.3/bin:/local/software/moab/6.1.10/sbin:/local/software/moab/6.1.10/bin:/local/software/torque/default/sbin:/local/software/torque/default/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/home/ocfacc/bin:/local/bin:.
> PWD=/home/ocfacc/hpl/fullrun
> _LMFILES_=/local/Modules/3.2.10/modulefiles/schedulers/torque/2.5.12:/local/Modules/3.2.10/modulefiles/schedulers/moab/6.1.10:/local/Modules/3.2.10/modulefiles/misc/null:/local/Modules/3.2.10/modulefiles/mpi/openmpi/1.6.4/gcc
> LANG=en_US.UTF-8
> KDE_IS_PRELINKED=1
> MOABHOMEDIR=/local/moab/6.1.10
> MODULEPATH=/local/Modules/versions:/local/Modules/modulefiles:/local/Modules/3.2.10/modulefiles/misc:/local/Modules/3.2.10/modulefiles/mpi:/local/Modules/3.2.10/modulefiles/libs:/local/Modules/3.2.10/modulefiles/compilers:/local/Modules/3.2.10/modulefiles/apps:/local/Modules/3.2.10/modulefiles/schedulers
> LOADEDMODULES=torque/2.5.12:moab/6.1.10:null:openmpi/1.6.4/gcc
> KDEDIRS=/usr
> PBS_SERVER=blue101,blue102
> SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
> HISTCONTROL=ignoredups
> SHLVL=1
> HOME=/home/ocfacc
> LOGNAME=ocfacc
> QTLIB=/usr/lib64/qt-3.3/lib
> CVS_RSH=ssh
> LC_CTYPE=POSIX
> MODULESHOME=/local/Modules/3.2.10
> LESSOPEN=|/usr/bin/lesspipe.sh %s
> G_BROKEN_FILENAMES=1
> module=() { eval `/local/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
> }
> _=/bin/env
>
>
>
>
>
>
>
>
>
> --
>
> Best Regards,
> Qamar Nazir
>
> HPC Software Engineer
>
> OCF plc
>
>
> Tel: 0114 257 2200 Twitter
>
> Fax: 0114 257 0022 Blog
>
> Mob: 07508 033895 Web
>
>
> OCF plc is a company registered in England and Wales. Registered number 4132533. Registered office address: OCF plc, 5 Rotunda Business Centre, Thorncliffe Park, Chapeltown, Sheffield, S35 2PG
>
>
> Please note, any emails relating to an OCF Support request must always be sent to support_at_[hidden] for a ticket number to be generated or existing support ticket to be updated. Should this not be done then OCF cannot be held responsible for requests not dealt with in a timely manner.
>
>
> This message is private and confidential. If you have received this message in error, please notify us immediately and remove it from your system.
>
>
> <ompi_info.txt.bz2>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users