Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi+sge
From: Reuti (reuti_at_[hidden])
Date: 2008-10-03 07:09:40


Am 03.10.2008 um 10:46 schrieb Jaime Perea:

> Hello again.
>
> Since I already had a 6.1 version of the sge I reverted to it
> and included the hacks (ssh, sshd -i and qlogin_wrap) and in
> this way both the interactives qsh and qrsh and batch qsub
> worked with openmpi.
> For me this is a solution, but I'm still curious of what it was
> going on in 6.2. I will see if there exists a list like this for the
> sge.

Sure there is, but we will meet again ;-)

http://gridengine.sunsource.net/maillist.html

It's the users_at_[hidden]

-- Reuti

>
> Thanks a lot.
>
> --
> Jaime Perea
>
> El Jueves, 2 de Octubre de 2008, Rolf Vandevaart escribió:
>> On 10/02/08 11:18, Reuti wrote:
>>> Am 02.10.2008 um 16:51 schrieb Jaime Perea:
>>>> Hi
>>>>
>>>> builtin, do I have to change them to ssh and sshd as in sge 6.1?
>>>
>>> I always used only rsh, as ssh doesn't provide a Tight Integration
>>> with correct accounting (unless you compiled SGE with -tigth-ssh on
>>> your own).
>>>
>>> But it would be worth a try with either the rsh or ssh stuff, as the
>>> builtin starter is a new feature of SGE 6.2.
>>>
>>> -- Reuti
>>
>> As was mentioned, SGE 6.2 has a new Integrated Job Starter so that
>> rsh
>> and ssh do not need to be used to start jobs on remote nodes.
>> This is
>> the recommended way of starting as it is faster than ssh and more
>> scalable than rsh. And, you do not need to do any hacks for
>> proper job
>> accounting like was needed for ssh.
>>
>> Under the covers, Open MPI uses qrsh to start the MPI jobs on all the
>> nodes.
>>
>> Not sure if that helps, but just wanted to mention that information.
>>
>> Rolf
>>
>>>> Thanks again
>>>>
>>>> --
>>>> Jaime Perea
>>>>
>>>> El Jueves, 2 de Octubre de 2008, Reuti escribió:
>>>>> Am 02.10.2008 um 16:12 schrieb Jaime Perea:
>>>>>> Hi again, thanks for the answer
>>>>>>
>>>>>> Actually I took the definition of the pe from the openmpi
>>>>>> webpage, in my case
>>>>>>
>>>>>> qconf -sp orte
>>>>>> pe_name orte
>>>>>> slots 24
>>>>>> user_lists NONE
>>>>>> xuser_lists NONE
>>>>>> start_proc_args /bin/true
>>>>>> stop_proc_args /bin/true
>>>>>> allocation_rule $round_robin
>>>>>> control_slaves TRUE
>>>>>> job_is_first_task TRUE
>>>>>> urgency_slots min
>>>>>> accounting_summary FALSE
>>>>>>
>>>>>> Our sge is version 6.2 and openmpi was configured with
>>>>>> the --with-sge switch of course.
>>>>>
>>>>> In SGE 6.2 two types of remote startup are implemented. Which one
>>>>> are you using (builtin or the former settings for each command) in
>>>>> the SGE configuration?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>> Regards
>>>>>>
>>>>>> --
>>>>>> Jaime Perea
>>>>>>
>>>>>> El Jueves, 2 de Octubre de 2008, Reuti escribió:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 02.10.2008 um 15:37 schrieb Jaime Perea:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am having some problems with a combination of openmpi+sge6.2
>>>>>>>>
>>>>>>>> Currently I'm working with the 1.3a1r19666 openmpi release and
>>>>>>>> the
>>>>>>>
>>>>>>> AFAIK, you have to enable SGE support in Open MPI 1.3 during its
>>>>>>> compilation.
>>>>>>>
>>>>>>>> myrinet gm libraries (2.1.19) but the problem was the same
>>>>>>>> with
>>>>>>>> the prior 1.3 version. In short, I'm able to send jobs to a que
>>>>>>>> via qrsh,
>>>>>>>> more or less this way,
>>>>>>>>
>>>>>>>> qrsh -cwd -V -q para -pe orte 6 mpirun -np 6 ctiming
>>>>>>>
>>>>>>> It should also work without specifying the number of slots a
>>>>>>> second time, i.e.:
>>>>>>>
>>>>>>> qrsh -cwd -V -q para -pe orte 6 mpirun ctiming
>>>>>>>
>>>>>>>> ctiming is a small test program and in this way it works,
>>>>>>>> but if
>>>>>>>> I try to
>>>>>>>> send the same task by using qsub on a script like this one
>>>>>>>>
>>>>>>>> #!/bin/sh
>>>>>>>> #$ -pe orte 6
>>>>>>>
>>>>>>> This PE has just /bin/true for start-/stop_proc_args?
>>>>>>>
>>>>>>>> #$ -q para
>>>>>>>> #$ -cwd
>>>>>>>> #
>>>>>>>> mpirun -np $NSLOTS /model/jaime/ctiming
>>>>>>>
>>>>>>> mpirun /model/jaime/ctiming
>>>>>>>
>>>>>>>> It fails with a message like this,
>>>>>>>> ..............
>>>>>>>>
>>>>>>>> error reading job context from "qlogin_starter"
>>>>>>>
>>>>>>> qlogin_starter should of course only be started with a qlogin
>>>>>>> command in SGE.
>>>>>>>
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> --
>>>>>>>> --- --
>>>>>>>> ----
>>>>>>>> A daemon (pid 11207) died unexpectedly with status 1 while
>>>>>>>> attempting
>>>>>>>> to launch so we are aborting.
>>>>>>>>
>>>>>>>> There may be more information reported by the environment (see
>>>>>>>> above).
>>>>>>>>
>>>>>>>> This may be because the daemon was unable to find all the
>>>>>>>> needed
>>>>>>>> shared
>>>>>>>> libraries on the remote node. You may set your
>>>>>>>> LD_LIBRARY_PATH to
>>>>>>>> have the
>>>>>>>> location of the shared libraries on the remote nodes and this
>>>>>>>> will automatically be forwarded to the remote nodes.
>>>>>>>>
>>>>>>>> .............
>>>>>>>>
>>>>>>>> I know that LD_LIBRARY_PATH is not the problem, since I
>>>>>>>> checked
>>>>>>>> that all
>>>>>>>> the environment is present.... any idea?
>>>>>>>>
>>>>>>>> For previous releases of the sge and openmpi I was able to do
>>>>>>>> them work
>>>>>>>> together with a few wrappers,
>>>>>>>
>>>>>>> Which version of SGE are you using?
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>> but now the integration looks much better!
>>>>>>>> This happen only when sending openmpi jobs.
>>>>>>>>
>>>>>>>> Thanks and all the best
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Jaime D. Perea Duarte. <jaime at iaa dot es>
>>>>>>>> Linux registered user #10472
>>>>>>>>
>>>>>>>> Dep. Astrofisica Extragalactica.
>>>>>>>> Instituto de Astrofisica de Andalucia (CSIC)
>>>>>>>> Apdo. 3004, 18080 Granada, Spain.
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>