Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for start
From: Sean Davis (sdavis2_at_[hidden])
Date: 2008-10-11 18:21:49


On Sat, Oct 11, 2008 at 5:34 PM, Pak Lui <plui_at_[hidden]> wrote:
> It looks like from your earlier discussions on gridengine user alias
> that you are able to run a simple single queue SGE tightly integrated
> parallel job with Open MPI, it's just a matter of using multiple queues
> with your parallel job, right?
>
> http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=26298
>
> The tm messages are just a red herring. What's more interesting is the
> verbose messages from qrsh (the lines that you enable by using -mca
> pls_gridengine_verbose 1, with lines started without the stuff prepended
> by OMPI, like [shakespeare:05720]).
>
>>> Starting server daemon at host "shakespeare.nci.nih.gov"
>>> Starting server daemon at host "octopus.nci.nih.gov"
>>> Server daemon successfully started with task id "1.shakespeare"
>>> [shakespeare:05733] mca: base: component_find: unable to open ras tm:
>>> file not found (ignored)
>>> [shakespeare:05733] mca: base: component_find: unable to open pls tm:
>>> file not found (ignored)
>>> error: executing task of job 3576 failed: failed sending task to
>>> execd_at_[hidden]: can't find connecti
>>> on
>
> Since you see these verbose messages here, it means that you are using
> "qrsh -inherit" in the backend for launching tasks. (You can also see
> the qrsh -inherit line by setting "-mca pls_gridnegine_debug 1" in mpirun.)
>
> You can also see the actual "qrsh -inherit" line by setting "-mca
> pls_gridnegine_debug 1" in mpirun.
>
> Those messages above show you that somehow when mpirun is trying to send
> the SGE tasks to the remote nodes to shakespeare and octopus via 2
> queues, shakespeare appears to start the server daemon successfully, but
> you don't seem to get the same message from octopus. Typically I see
> only 1 message from the server daemon when I use only 1 queue in my
> parallel job.
>
> In order for the head node's "qrsh -inherit" tasks to be accepted by SGE
> daemons on execution nodes, the execution daemons need to be
> allocated/notified ahead of time that there are impending tasks coming
> to the nodes.
>
> Anyway, I don't know why it needs to start the server daemon on octopus
> when you have 2 queues in your parallel job. But let's say it's the
> right behavior, SGE seems to have problem starting the task from the
> headnode shakespeare to octopus (therefore we are the "failed sending
> task to execd: can't find connection message). Did you already try
> connecting from shakespeare to octopus? You might also want to check out
> messages on octopus' log file $SGE_ROOT/default/spool/octopus/messages
> to see how exactly it isn't accepting the task.
>
> It may also be worthwhile to ask the gridengine folks if anyone has
> tried with parallel job on multiple queues. I am not sure how typical
> that people use this SGE feature.
>
> I don't have access to a SGE cluster but I notice from an online manual
> there's a new qsub option (-masterq) in SGE 6.2 that may work. You might
> want to give it a try. This looks more and more like an SGE issue not
> able to accept tasks from multiple queues for parallel job.
>
> btw, you don't need the --with-sge switch in OMPI configure. It's new in
> OMPI v1.3 so that we don't build SGE support by default.
>
> My $.02...

Thanks, Pak. There is only one queue on the SGE system. Of course,
there are queue instances for each machine, which is the usual for
SGE.

I'll give the -masterq a look. And the messages files for the
involved machines are devoid of anything useful; in fact, there is no
mention of these jobs, in general.

Sean

>> Date: Sat, 11 Oct 2008 07:56:02 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for
>> start
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <3E62159B-14B9-4D44-96F6-0345079BCCE5_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>>
>> I don't know much/anything about SGE (I'll leave that to the Sun folks on
>> this list to reply), but I can tell you about the tm plugins: tm is the
>> protocol used by the PBS/Torque family of launchers. It looks like your
>> Open MPI was built with TM support, but when you launch, it's likely unable
>> to find the support libraries that it needs to load those plugins.
>>
>> This is probably fine in your case, since you want to use SGE, not TM.
>>
>>
>> On Oct 9, 2008, at 4:40 PM, Sean Davis wrote:
>>
>>> I am relatively new to OpenMPI and Sun Grid Engine parallel
>>> integration. I have a small cluster that is running SGE6.2 on linux
>>> machines all using Intel Xeon processors. I have installed OpenMPI
>>> 1.2.7 from source using the --with-sge switch. Now, I am trying to
>>> troubleshoot some problems I am having. I have created a simple job
>>> script:
>>>
>>> The job script looks like:
>>> #!/bin/bash
>>> #$ -S /bin/bash
>>> #$ -cwd
>>> mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname
>>>
>>> And the output on the error stream:
>>>>
>>>> more junksub.sh.e3574
>>>
>>> [shakespeare:05720] mca: base: component_find: unable to open ras tm:
>>> file not found (ignored)
>>> [shakespeare:05720] mca: base: component_find: unable to open pls tm:
>>> file not found (ignored)
>>> Starting server daemon at host "shakespeare.nci.nih.gov"
>>> Starting server daemon at host "octopus.nci.nih.gov"
>>> Server daemon successfully started with task id "1.shakespeare"
>>> [shakespeare:05733] mca: base: component_find: unable to open ras tm:
>>> file not found (ignored)
>>> [shakespeare:05733] mca: base: component_find: unable to open pls tm:
>>> file not found (ignored)
>>> error: executing task of job 3576 failed: failed sending task to
>>> execd_at_[hidden]: can't find connecti
>>> on
>>> [shakespeare:05720] ERROR: A daemon on node octopus.nci.nih.gov failed
>>> to start as expected.
>>> [shakespeare:05720] ERROR: There may be more information available from
>>> [shakespeare:05720] ERROR: the 'qstat -t' command on the Grid Engine
>>> tasks.
>>> [shakespeare:05720] ERROR: If the problem persists, please restart the
>>> [shakespeare:05720] ERROR: Grid Engine PE job
>>> [shakespeare:05720] ERROR: The daemon exited unexpectedly with status 1.
>>>
>>> However, there is no output in any output stream.
>>>
>>> And if I log into shakespeare and qrsh -q all.q_at_octopus, I immediately
>>> get a slot, so there isn't a "direct" problem with connecting.
>>>
>>> As I got a hint from folks on the SGE mailing list, it appears that
>>> qrsh is not being used for job submission. Any suggestions as to why
>>> this might be the case (or if it is the case)?
>>>
>>> Thanks,
>>> Sean
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>