Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] plm:tm: failed to spawn daemon, error code = 17000 Error when running jobs on 600 or more nodes
From: Qamar Nazir (qnazir_at_[hidden])
Date: 2013-05-17 06:40:02


Guys,

Many thanks for your replies. It has been confirmed by Torque people
that the version 2.5.x has this limitation and they are recommending to
upgrade it to 4.2.x.
Qamar Nazir

Best Regards,

*Qamar Nazir*

On 05/16/2013 05:52 PM, Gus Correa wrote:
> Hi Qamar
>
> I don't have a cluster as large as yours,
> but I know Torque requires special settings for large
> clusters:
>
> http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml
>
> My tm_.h (Torque 2.4.11) says:
>
> #define TM_ESYSTEM 17000
> #define TM_ENOEVENT 17001
> #define TM_ENOTCONNECTED 17002
>
> and TM_ESYSTEM may be sent back by pbs_mom (see mom_comm.c)
> if it cannot start the user process.
>
> Have you tried to launch a simple "hostname" command with pbsdsh
> on >600 nodes?
>
> Diskless/stateless nodes, if you have them, may present another
> challenge (say, regarding /tmp):
> http://www.supercluster.org/pipermail/torqueusers/2011-March/012453.html
> http://www.open-mpi.org/faq/?category=all#poor-sm-btl-performance
> http://www.open-mpi.org/faq/?category=all#network-vs-local
>
> I hope this helps,
> Gus Correa
>
> On 05/16/2013 12:21 PM, Ralph Castain wrote:
>> Check the torque error constants - i'm not sure what that value means,
>> but torque is reporting the error. all we do is print out the value they
>> return if it is an error
>>
>>
>> On May 16, 2013, at 9:09 AM, Qamar Nazir <qnazir_at_[hidden]
>> <mailto:qnazir_at_[hidden]>> wrote:
>>
>>> Dear Support,
>>>
>>> We are having an issue with our OMPI runs. When we run jobs on <=550
>>> machines (550 x 16 cores) then they work without any problem. As soon
>>> as we run them on 600 or more machines we get the "plm:tm: failed to
>>> spawn daemon, error code = 17000" Error
>>>
>>> We are using:
>>>
>>> OpenMPI ver: 1.6.4 (Compiled with GCC v4.4.6)
>>> Torque ver: 2.5.12
>>>
>>> The ompi_info's output is attached.
>>>
>>>
>>> The Environmentstats have been pasted below.
>>>
>>>
>>> Please assist.
>>>
>>>
>>> env envsubst
>>> [ocfacc_at_cyan01 fullrun]$ env
>>> MODULE_VERSION_STACK=3.2.10
>>> OMPI_MCA_mtl=^psm
>>> MANPATH=/local/software/openmpi/1.6.4/gcc/share/man:/local/software/moab/6.1.10/man:/usr/local/share/man:/usr/share/man/overrides:/usr/share/man:/local/Modules/default/share/man
>>> HOSTNAME=cyan01
>>> SHELL=/bin/bash
>>> TERM=xterm
>>> HISTSIZE=1000
>>> QTDIR=/usr/lib64/qt-3.3
>>> OLDPWD=/home/ocfacc/hpl/fullrun/results
>>> QTINC=/usr/lib64/qt-3.3/include
>>> LC_ALL=POSIX
>>> USER=ocfacc
>>> LD_LIBRARY_PATH=/local/software/openmpi/1.6.4/gcc/lib:/local/software/torque/default/lib
>>> LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;!
> 35:*.xcf=0
> 1;3
>>> 5:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
>>> MPIROOT=/local/software/openmpi/1.6.4/gcc
>>> MODULE_VERSION=3.2.10
>>> MAIL=/var/spool/mail/ocfacc
>>> PATH=/local/software/openmpi/1.6.4/gcc/bin:/usr/lib64/qt-3.3/bin:/local/software/moab/6.1.10/sbin:/local/software/moab/6.1.10/bin:/local/software/torque/default/sbin:/local/software/torque/default/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lpp/mmfs/bin:/home/ocfacc/bin:/local/bin:.
>>> PWD=/home/ocfacc/hpl/fullrun
>>> _LMFILES_=/local/Modules/3.2.10/modulefiles/schedulers/torque/2.5.12:/local/Modules/3.2.10/modulefiles/schedulers/moab/6.1.10:/local/Modules/3.2.10/modulefiles/misc/null:/local/Modules/3.2.10/modulefiles/mpi/openmpi/1.6.4/gcc
>>> LANG=en_US.UTF-8
>>> KDE_IS_PRELINKED=1
>>> MOABHOMEDIR=/local/moab/6.1.10
>>> MODULEPATH=/local/Modules/versions:/local/Modules/modulefiles:/local/Modules/3.2.10/modulefiles/misc:/local/Modules/3.2.10/modulefiles/mpi:/local/Modules/3.2.10/modulefiles/libs:/local/Modules/3.2.10/modulefiles/compilers:/local/Modules/3.2.10/modulefiles/apps:/local/Modules/3.2.10/modulefiles/schedulers
>>> LOADEDMODULES=torque/2.5.12:moab/6.1.10:null:openmpi/1.6.4/gcc
>>> KDEDIRS=/usr
>>> PBS_SERVER=blue101,blue102
>>> SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
>>> HISTCONTROL=ignoredups
>>> SHLVL=1
>>> HOME=/home/ocfacc
>>> LOGNAME=ocfacc
>>> QTLIB=/usr/lib64/qt-3.3/lib
>>> CVS_RSH=ssh
>>> LC_CTYPE=POSIX
>>> MODULESHOME=/local/Modules/3.2.10
>>> LESSOPEN=|/usr/bin/lesspipe.sh %s
>>> G_BROKEN_FILENAMES=1
>>> module=() { eval `/local/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
>>> }
>>> _=/bin/env
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards,
>>>
>>> *Qamar Nazir*
>>>
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users