There
was a bug in early Torque 2.1.x versions (I'm afraid I don't remember which one)
that -- I think -- had something to do with a faulty poll()
implementation. Whatever the problem was, it caused all TM launchers
to fail on OSX.
Can
you see if the Torque-included tool pbsdsh works properly? It uses the
same API that Open MPI does (the "tm" api).
If pbsdsh fails, I suspect you're looking at a Torque bug. I
know that Garrick S. has since fixed the problem in the Torque code
base; I don't know if they've had a release since then that included the
fix.
If
pbsdsh works, let us know and we'll track this down
further.
I'm having trouble getting OpenMPI to execute jobs when submitting
through Torque.
Everything works fine if I am to use the included mpirun
scripts, but this is obviously
not a good solution for the general users on
the cluster.
I'm running under OS X 10.4, Darwin 8.6.0. I
configured OpenMpi with:
export CC=/opt/ibmcmp/vac/6.0/bin/xlc
export
CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++
export
FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r
export
F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r
export LDFLAGS=-lSystemStubs
export
LIBTOOL=glibtool
PREFIX=/usr/local/ompi-xl
./configure
\
--prefix=$PREFIX \
--with-tm=/usr/local/pbs/ \
--with-gm=/opt/gm
\
--enable-static \
--disable-cxx
I also had to employ the fix listed in:
http://www.open-mpi.org/community/lists/users/2006/04/1007.php
I've
attached the output of ompi_info while in an interactive job. Looking
through the list,
I can at least save a bit of trouble by listing what does
work. Anything outside of Torque
seems fine. From within an
interactive job, pbsdsh works fine, hence the earlier problems
with poll
are fixed.
Here is the error that is reported when I attemt to run
hostname on one processor:
node96:/usr/src/openmpi-1.1 jbronder$
/usr/local/ompi-xl/bin/mpirun -np 1 -mca pls_tm_debug 1 /bin/hostname
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: final top-level argv:
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: orted --no-daemonize --bootproxy 1
--name --num_procs 2 --vpid_start 0 --nodename --universe
jbronder@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica
"0.0.0;tcp://10.0.1.96:49395" --gprreplica
"0.0.0;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: Set prefix:/usr/local/ompi-xl
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: launching on node localhost
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: resetting PATH:
/usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: found /usr/local/ompi-xl/bin/orted
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs
2 --vpid_start 0 --nodename localhost --universe
jbronder@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica
"0.0.0;tcp://10.0.1.96:49395" --gprreplica
"0.0.0;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850]
pls:tm: start_procs returned error -13
[node96.meldrew.clusters.umaine.edu:00850]
[0,0,0] ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 184
[node96.meldrew.clusters.umaine.edu:00850]
[0,0,0] ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 432
[node96.meldrew.clusters.umaine.edu:00850]
mpirun: spawn failed with errno=-13
node96:/usr/src/openmpi-1.1 jbronder$
My thanks for any help in advance,
Justin
Bronder.