Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi+sge
From: Jaime Perea (jaime.perea_at_[hidden])
Date: 2008-10-03 04:46:46


Hello again.

Since I already had a 6.1 version of the sge I reverted to it
and included the hacks (ssh, sshd -i and qlogin_wrap) and in
this way both the interactives qsh and qrsh and batch qsub
worked with openmpi.
For me this is a solution, but I'm still curious of what it was
going on in 6.2. I will see if there exists a list like this for the
sge.

Thanks a lot.
 

--
Jaime Perea
El Jueves, 2 de Octubre de 2008, Rolf Vandevaart escribió:
> On 10/02/08 11:18, Reuti wrote:
> > Am 02.10.2008 um 16:51 schrieb Jaime Perea:
> >> Hi
> >>
> >> builtin, do I have to change them to ssh and sshd as in sge 6.1?
> >
> > I always used only rsh, as ssh doesn't provide a Tight Integration
> > with correct accounting (unless you compiled SGE with -tigth-ssh on
> > your own).
> >
> > But it would be worth a try with either the rsh or ssh stuff, as the
> > builtin starter is a new feature of SGE 6.2.
> >
> > -- Reuti
>
> As was mentioned, SGE 6.2 has a new Integrated Job Starter so that rsh
> and ssh do not need to be used to start jobs on remote nodes.  This is
> the recommended way of starting as it is faster than ssh and more
> scalable than rsh.  And, you do not need to do any hacks for proper job
> accounting like was needed for ssh.
>
> Under the covers, Open MPI uses qrsh to start the MPI jobs on all the
> nodes.
>
> Not sure if that helps, but just wanted to mention that information.
>
> Rolf
>
> >> Thanks again
> >>
> >> --
> >> Jaime Perea
> >>
> >> El Jueves, 2 de Octubre de 2008, Reuti escribió:
> >>> Am 02.10.2008 um 16:12 schrieb Jaime Perea:
> >>>> Hi again, thanks for the answer
> >>>>
> >>>> Actually I took the definition of the pe from the openmpi
> >>>> webpage, in my case
> >>>>
> >>>> qconf -sp orte
> >>>> pe_name            orte
> >>>> slots              24
> >>>> user_lists         NONE
> >>>> xuser_lists        NONE
> >>>> start_proc_args    /bin/true
> >>>> stop_proc_args     /bin/true
> >>>> allocation_rule    $round_robin
> >>>> control_slaves     TRUE
> >>>> job_is_first_task  TRUE
> >>>> urgency_slots      min
> >>>> accounting_summary FALSE
> >>>>
> >>>> Our sge is version 6.2 and openmpi was configured with
> >>>> the --with-sge switch of course.
> >>>
> >>> In SGE 6.2 two types of remote startup are implemented. Which one
> >>> are you using (builtin or the former settings for each command) in
> >>> the SGE configuration?
> >>>
> >>> -- Reuti
> >>>
> >>>> Regards
> >>>>
> >>>> --
> >>>> Jaime Perea
> >>>>
> >>>> El Jueves, 2 de Octubre de 2008, Reuti escribió:
> >>>>> Hi,
> >>>>>
> >>>>> Am 02.10.2008 um 15:37 schrieb Jaime Perea:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I am having some problems with a combination of openmpi+sge6.2
> >>>>>>
> >>>>>> Currently I'm working with the 1.3a1r19666 openmpi release and
> >>>>>> the
> >>>>>
> >>>>> AFAIK, you have to enable SGE support in Open MPI 1.3 during its
> >>>>> compilation.
> >>>>>
> >>>>>> myrinet gm libraries (2.1.19)  but the problem was the same with
> >>>>>> the prior 1.3 version. In short, I'm able to send jobs to a que
> >>>>>> via qrsh,
> >>>>>> more or less this way,
> >>>>>>
> >>>>>> qrsh -cwd -V -q para -pe orte 6 mpirun -np 6 ctiming
> >>>>>
> >>>>> It should also work without specifying the number of slots a
> >>>>> second time, i.e.:
> >>>>>
> >>>>> qrsh -cwd -V -q para -pe orte 6 mpirun ctiming
> >>>>>
> >>>>>> ctiming is a small test program and in this way it works, but if
> >>>>>> I try to
> >>>>>> send the same task by using qsub on a script like this one
> >>>>>>
> >>>>>> #!/bin/sh
> >>>>>> #$ -pe orte 6
> >>>>>
> >>>>> This PE has just /bin/true for start-/stop_proc_args?
> >>>>>
> >>>>>> #$ -q para
> >>>>>> #$ -cwd
> >>>>>> #
> >>>>>> mpirun -np $NSLOTS  /model/jaime/ctiming
> >>>>>
> >>>>> mpirun /model/jaime/ctiming
> >>>>>
> >>>>>> It fails with a message like this,
> >>>>>> ..............
> >>>>>>
> >>>>>> error reading job context from "qlogin_starter"
> >>>>>
> >>>>> qlogin_starter should of course only be started with a qlogin
> >>>>> command in SGE.
> >>>>>
> >>>>>> -----------------------------------------------------------------
> >>>>>>--- --
> >>>>>> ----
> >>>>>> A daemon (pid 11207) died unexpectedly with status 1 while
> >>>>>> attempting
> >>>>>> to launch so we are aborting.
> >>>>>>
> >>>>>> There may be more information reported by the environment (see
> >>>>>> above).
> >>>>>>
> >>>>>> This may be because the daemon was unable to find all the needed
> >>>>>> shared
> >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> >>>>>> have the
> >>>>>> location of the shared libraries on the remote nodes and this
> >>>>>> will automatically be forwarded to the remote nodes.
> >>>>>>
> >>>>>> .............
> >>>>>>
> >>>>>> I know that LD_LIBRARY_PATH is not the problem,  since I checked
> >>>>>> that all
> >>>>>> the environment is present.... any idea?
> >>>>>>
> >>>>>> For previous releases of the sge and openmpi I was able to do
> >>>>>> them work
> >>>>>> together with a few wrappers,
> >>>>>
> >>>>> Which version of SGE are you using?
> >>>>>
> >>>>> -- Reuti
> >>>>>
> >>>>>> but now the integration looks much better!
> >>>>>> This happen only when sending openmpi jobs.
> >>>>>>
> >>>>>> Thanks and all the best
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>>            Jaime D. Perea Duarte. <jaime at iaa dot es>
> >>>>>>              Linux registered user #10472
> >>>>>>
> >>>>>>            Dep. Astrofisica Extragalactica.
> >>>>>>            Instituto de Astrofisica de Andalucia (CSIC)
> >>>>>>            Apdo. 3004, 18080 Granada, Spain.
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users