Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for start
From: Reuti (reuti_at_[hidden])
Date: 2008-10-11 18:48:04


Am 12.10.2008 um 00:21 schrieb Sean Davis:

> <snip>
>
> Thanks, Pak. There is only one queue on the SGE system. Of course,
> there are queue instances for each machine, which is the usual for
> SGE.
>
> I'll give the -masterq a look. And the messages files for the
> involved machines are devoid of anything useful; in fact, there is no
> mention of these jobs, in general.

Hi,

to see more, you can set "loglevel log_info" in the scheduler
configuration.

Do you have more than one network card installed and gave them the
same name?
Your defined "tmpdir" is local on each machine?
Do you redifine $TMPDIR in your .bashrc or anything else therein?

-- Reuti

> Sean
>
>>> Date: Sat, 11 Oct 2008 07:56:02 -0400
>>> From: Jeff Squyres <jsquyres_at_[hidden]>
>>> Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol
>>> for
>>> start
>>> To: Open MPI Users <users_at_[hidden]>
>>> Message-ID: <3E62159B-14B9-4D44-96F6-0345079BCCE5_at_[hidden]>
>>> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>>>
>>> I don't know much/anything about SGE (I'll leave that to the Sun
>>> folks on
>>> this list to reply), but I can tell you about the tm plugins: tm
>>> is the
>>> protocol used by the PBS/Torque family of launchers. It looks
>>> like your
>>> Open MPI was built with TM support, but when you launch, it's
>>> likely unable
>>> to find the support libraries that it needs to load those plugins.
>>>
>>> This is probably fine in your case, since you want to use SGE,
>>> not TM.
>>>
>>>
>>> On Oct 9, 2008, at 4:40 PM, Sean Davis wrote:
>>>
>>>> I am relatively new to OpenMPI and Sun Grid Engine parallel
>>>> integration. I have a small cluster that is running SGE6.2 on
>>>> linux
>>>> machines all using Intel Xeon processors. I have installed OpenMPI
>>>> 1.2.7 from source using the --with-sge switch. Now, I am trying to
>>>> troubleshoot some problems I am having. I have created a simple
>>>> job
>>>> script:
>>>>
>>>> The job script looks like:
>>>> #!/bin/bash
>>>> #$ -S /bin/bash
>>>> #$ -cwd
>>>> mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname
>>>>
>>>> And the output on the error stream:
>>>>>
>>>>> more junksub.sh.e3574
>>>>
>>>> [shakespeare:05720] mca: base: component_find: unable to open
>>>> ras tm:
>>>> file not found (ignored)
>>>> [shakespeare:05720] mca: base: component_find: unable to open
>>>> pls tm:
>>>> file not found (ignored)
>>>> Starting server daemon at host "shakespeare.nci.nih.gov"
>>>> Starting server daemon at host "octopus.nci.nih.gov"
>>>> Server daemon successfully started with task id "1.shakespeare"
>>>> [shakespeare:05733] mca: base: component_find: unable to open
>>>> ras tm:
>>>> file not found (ignored)
>>>> [shakespeare:05733] mca: base: component_find: unable to open
>>>> pls tm:
>>>> file not found (ignored)
>>>> error: executing task of job 3576 failed: failed sending task to
>>>> execd_at_[hidden]: can't find connecti
>>>> on
>>>> [shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.gov
>>>> failed
>>>> to start as expected.
>>>> [shakespeare:05720] ERROR: There may be more information
>>>> available from
>>>> [shakespeare:05720] ERROR: the 'qstat -t' command on the Grid
>>>> Engine
>>>> tasks.
>>>> [shakespeare:05720] ERROR: If the problem persists, please
>>>> restart the
>>>> [shakespeare:05720] ERROR: Grid Engine PE job
>>>> [shakespeare:05720] ERROR: The daemon exited unexpectedly with
>>>> status 1.
>>>>
>>>> However, there is no output in any output stream.
>>>>
>>>> And if I log into shakespeare and qrsh -q all.q_at_octopus, I
>>>> immediately
>>>> get a slot, so there isn't a "direct" problem with connecting.
>>>>
>>>> As I got a hint from folks on the SGE mailing list, it appears that
>>>> qrsh is not being used for job submission. Any suggestions as
>>>> to why
>>>> this might be the case (or if it is the case)?
>>>>
>>>> Thanks,
>>>> Sean
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users