Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for start
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-10-11 07:56:02


I don't know much/anything about SGE (I'll leave that to the Sun folks
on this list to reply), but I can tell you about the tm plugins: tm is
the protocol used by the PBS/Torque family of launchers. It looks
like your Open MPI was built with TM support, but when you launch,
it's likely unable to find the support libraries that it needs to load
those plugins.

This is probably fine in your case, since you want to use SGE, not TM.

On Oct 9, 2008, at 4:40 PM, Sean Davis wrote:

> I am relatively new to OpenMPI and Sun Grid Engine parallel
> integration. I have a small cluster that is running SGE6.2 on linux
> machines all using Intel Xeon processors. I have installed OpenMPI
> 1.2.7 from source using the --with-sge switch. Now, I am trying to
> troubleshoot some problems I am having. I have created a simple job
> script:
>
> The job script looks like:
> #!/bin/bash
> #$ -S /bin/bash
> #$ -cwd
> mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname
>
> And the output on the error stream:
>> more junksub.sh.e3574
> [shakespeare:05720] mca: base: component_find: unable to open ras tm:
> file not found (ignored)
> [shakespeare:05720] mca: base: component_find: unable to open pls tm:
> file not found (ignored)
> Starting server daemon at host "shakespeare.nci.nih.gov"
> Starting server daemon at host "octopus.nci.nih.gov"
> Server daemon successfully started with task id "1.shakespeare"
> [shakespeare:05733] mca: base: component_find: unable to open ras tm:
> file not found (ignored)
> [shakespeare:05733] mca: base: component_find: unable to open pls tm:
> file not found (ignored)
> error: executing task of job 3576 failed: failed sending task to
> execd_at_[hidden]: can't find connecti
> on
> [shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.gov failed
> to start as expected.
> [shakespeare:05720] ERROR: There may be more information available
> from
> [shakespeare:05720] ERROR: the 'qstat -t' command on the Grid Engine
> tasks.
> [shakespeare:05720] ERROR: If the problem persists, please restart the
> [shakespeare:05720] ERROR: Grid Engine PE job
> [shakespeare:05720] ERROR: The daemon exited unexpectedly with
> status 1.
>
> However, there is no output in any output stream.
>
> And if I log into shakespeare and qrsh -q all.q_at_octopus, I immediately
> get a slot, so there isn't a "direct" problem with connecting.
>
> As I got a hint from folks on the SGE mailing list, it appears that
> qrsh is not being used for job submission. Any suggestions as to why
> this might be the case (or if it is the case)?
>
> Thanks,
> Sean
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems