Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Justin Bronder (jsbronder_at_[hidden])
Date: 2006-06-30 09:07:20


Greetings,

The bug with poll was fixed in the stable Torque 2.1.1 release, and I have
checked again
to make sure that pbsdsh does work.

jbronder_at_meldrew-linux ~/src/hpl $ qsub -I -q default -l nodes=4:ppn=2 -l
opsys=darwin
qsub: waiting for job 312.ldap1.meldrew.clusters.umaine.edu to start
qsub: job 312.ldap1.meldrew.clusters.umaine.edu ready

node96:~ jbronder$ pbsdsh uname -a
Darwin node96.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node96.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node94.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node94.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node95.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node95.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node93.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node93.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
node96:~ jbronder$

If there is anything else I should check, please let me know.

Thanks,

Justin Bronder.

On 6/30/06, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
>
> There was a bug in early Torque 2.1.x versions (I'm afraid I don't
> remember which one) that -- I think -- had something to do with a faulty
> poll() implementation. Whatever the problem was, it caused all TM launchers
> to fail on OSX.
>
> Can you see if the Torque-included tool pbsdsh works properly? It uses
> the same API that Open MPI does (the "tm" api).
>
> If pbsdsh fails, I suspect you're looking at a Torque bug. I know
> that Garrick S. has since fixed the problem in the Torque code base; I don't
> know if they've had a release since then that included the fix.
>
> If pbsdsh works, let us know and we'll track this down further.
>
> ------------------------------
> *From:* users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] *On
> Behalf Of *Justin Bronder
> *Sent:* Thursday, June 29, 2006 5:19 PM
> *To:* users_at_[hidden]
> *Subject:* [OMPI users] OpenMpi 1.1 and Torque 2.1.1
>
> I'm having trouble getting OpenMPI to execute jobs when submitting through
> Torque.
> Everything works fine if I am to use the included mpirun scripts, but this
> is obviously
> not a good solution for the general users on the cluster.
>
> I'm running under OS X 10.4, Darwin 8.6.0. I configured OpenMpi with:
> export CC=/opt/ibmcmp/vac/6.0/bin/xlc
> export CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++
> export FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r
> export F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r
> export LDFLAGS=-lSystemStubs
> export LIBTOOL=glibtool
>
> PREFIX=/usr/local/ompi-xl
>
> ./configure \
> --prefix=$PREFIX \
> --with-tm=/usr/local/pbs/ \
> --with-gm=/opt/gm \
> --enable-static \
> --disable-cxx
>
> I also had to employ the fix listed in:
> http://www.open-mpi.org/community/lists/users/2006/04/1007.php
>
>
> I've attached the output of ompi_info while in an interactive job.
> Looking through the list,
> I can at least save a bit of trouble by listing what does work. Anything
> outside of Torque
> seems fine. From within an interactive job, pbsdsh works fine, hence the
> earlier problems
> with poll are fixed.
>
> Here is the error that is reported when I attemt to run hostname on one
> processor:
> node96:/usr/src/openmpi-1.1 jbronder$ /usr/local/ompi-xl/bin/mpirun -np 1
> -mca pls_tm_debug 1 /bin/hostname
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: final top-level argv:
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: orted
> --no-daemonize --bootproxy 1 --name --num_procs 2 --vpid_start 0
> --nodename --universe jbronder_at_[hidden]:default-universe
> --nsreplica "0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0
> ;tcp://10.0.1.96:49395"
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: Set
> prefix:/usr/local/ompi-xl
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: launching on node
> localhost
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: resetting PATH:
> /usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: found
> /usr/local/ompi-xl/bin/orted
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: not oversubscribed --
> setting mpi_yield_when_idle to 0
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: executing: orted
> --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0
> --nodename localhost --universe
> jbronder_at_[hidden]:default-universe --nsreplica "
> 0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0;tcp://10.0.1.96:49395"
> [node96.meldrew.clusters.umaine.edu:00850] pls:tm: start_procs returned
> error -13
> [node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not
> found in file rmgr_urm.c at line 184
> [node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not
> found in file rmgr_urm.c at line 432
> [node96.meldrew.clusters.umaine.edu:00850] mpirun: spawn failed with
> errno=-13
> node96:/usr/src/openmpi-1.1 jbronder$
>
>
> My thanks for any help in advance,
>
> Justin Bronder.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>