Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI and Torque
From: Randall Svancara (rsvancara_at_[hidden])
Date: 2011-03-21 13:53:19


I am not sure if there is any extra configuration necessary for torque
to forward the environment. I have included the output of printenv
for an interactive qsub session. I am really at a loss here because I
never had this much difficulty making torque run with openmpi. It has
been mostly a good experience.

Permissions of /tmp

drwxrwxrwt 4 root root 140 Mar 20 08:57 tmp

mpiexec hostname single node:

[rsvancara_at_login1 ~]$ qsub -I -lnodes=1:ppn=12
qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
qsub: job 1667.mgt1.wsuhpc.edu ready

[rsvancara_at_node100 ~]$ mpiexec hostname
node100
node100
node100
node100
node100
node100
node100
node100
node100
node100
node100
node100

mpiexec hostname two nodes:

[rsvancara_at_node100 ~]$ mpiexec hostname
[node100:09342] plm:tm: failed to poll for a spawned daemon, return
status = 17002
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        node99 - daemon did not report back when launched

MPIexec on one node with one cpu:

[rsvancara_at_node164 ~]$ mpiexec printenv
OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
MODULE_VERSION_STACK=3.2.8
MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
HOSTNAME=node164
PBS_VERSION=TORQUE-2.4.7
TERM=xterm
SHELL=/bin/bash
HISTSIZE=1000
PBS_JOBNAME=STDIN
PBS_ENVIRONMENT=PBS_INTERACTIVE
PBS_O_WORKDIR=/home/admins/rsvancara
PBS_TASKNUM=1
USER=rsvancara
LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
PBS_O_HOME=/home/admins/rsvancara
CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
PBS_MOMPORT=15003
PBS_O_QUEUE=batch
NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
MODULE_VERSION=3.2.8
MAIL=/var/spool/mail/rsvancara
PBS_O_LOGNAME=rsvancara
PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
PBS_O_LANG=en_US.UTF-8
PBS_JOBCOOKIE=D52DE562B685A462849C1136D6B581F9
INPUTRC=/etc/inputrc
PWD=/home/admins/rsvancara
_LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel
PBS_NODENUM=0
LANG=C
MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles
LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel
PBS_O_SHELL=/bin/bash
PBS_SERVER=mgt1.wsuhpc.edu
PBS_JOBID=1670.mgt1.wsuhpc.edu
SHLVL=1
HOME=/home/admins/rsvancara
INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses
PBS_O_HOST=login1
DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
PBS_VNODENUM=0
LOGNAME=rsvancara
PBS_QUEUE=batch
MODULESHOME=/home/software/mpi/intel/openmpi-1.4.3
LESSOPEN=|/usr/bin/lesspipe.sh %s
PBS_O_MAIL=/var/spool/mail/rsvancara
G_BROKEN_FILENAMES=1
PBS_NODEFILE=/var/spool/torque/aux//1670.mgt1.wsuhpc.edu
PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
module=() { eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
_=/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec
OMPI_MCA_orte_local_daemon_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
OMPI_MCA_orte_hnp_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=0
OMPI_UNIVERSE_SIZE=1
OMPI_MCA_ess=env
OMPI_MCA_orte_ess_num_procs=1
OMPI_COMM_WORLD_SIZE=1
OMPI_COMM_WORLD_LOCAL_SIZE=1
OMPI_MCA_orte_ess_jobid=3236233217
OMPI_MCA_orte_ess_vpid=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OPAL_OUTPUT_STDERR_FD=19

MPIExec with -mca plm rsh:

[rsvancara_at_node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdir_base
/fastscratch/admins/tmp hostname
node164
node164
node164
node164
node164
node164
node164
node164
node164
node164
node164
node164
node163
node163
node163
node163
node163
node163
node163
node163
node163
node163
node163
node163

On Mon, Mar 21, 2011 at 9:22 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> Can you run anything under TM? Try running "hostname" directly from Torque to see if anything works at all.
>
> The error message is telling you that the Torque daemon on the remote node reported a failure when trying to launch the OMPI daemon. Could be that Torque isn't setup to forward environments so the OMPI daemon isn't finding required libs. You could directly run "printenv" to see how your remote environ is being setup.
>
> Could be that the tmp dir lacks correct permissions for a user to create the required directories. The OMPI daemon tries to create a session directory in the tmp dir, so failure to do so would indeed cause the launch to fail. You can specify the tmp dir with a cmd line option to mpirun. See "mpirun -h" for info.
>
>
> On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote:
>
>> I have a question about using OpenMPI and Torque on stateless nodes.
>> I have compiled openmpi 1.4.3 with --with-tm=/usr/local
>> --without-slurm using intel compiler version 11.1.075.
>>
>> When I run a simple "hello world" mpi program, I am receiving the
>> following error.
>>
>> [node164:11193] plm:tm: failed to poll for a spawned daemon, return
>> status = 17002
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpiexec noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>>         node163 - daemon did not report back when launched
>>         node159 - daemon did not report back when launched
>>         node158 - daemon did not report back when launched
>>         node157 - daemon did not report back when launched
>>         node156 - daemon did not report back when launched
>>         node155 - daemon did not report back when launched
>>         node154 - daemon did not report back when launched
>>         node152 - daemon did not report back when launched
>>         node151 - daemon did not report back when launched
>>         node150 - daemon did not report back when launched
>>         node149 - daemon did not report back when launched
>>
>>
>> But if I include:
>>
>> -mca plm rsh
>>
>> The job runs just fine.
>>
>> I am not sure what the problem is with torque or openmpi that prevents
>> the process from launching on remote nodes.  I have posted to the
>> torque list and someone suggested that it may be temporary directory
>> space that can be causing issues.  I have 100MB allocated to /tmp
>>
>> Any ideas as to why I am having this problem would be appreciated.
>>
>>
>> --
>> Randall Svancara
>> http://knowyourlinux.com/
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Randall Svancara
http://knowyourlinux.com/