Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trouble with SGE integration
From: Reuti (reuti_at_[hidden])
Date: 2009-11-30 17:48:00


Am 30.11.2009 um 20:07 schrieb Ondrej Glembek:

> I definitely compiled the package with --with-sge flag... Here's my
> configure log:
>
> ./configure --prefix=/homes/kazi/glembek/share/openmpi-1.3.3-64 --
> with-sge --enable-shared --enable-static --host=x86_64-linux --
> build=x86_64-linux NM=x86_64-linux-nm

Is there any list of valid values for --host, --build and NM - and
what is NM for? From the ./configure --help I would "assume" that one
can tell Open MPI to prepare to BUILD on a PPC platform, although I'm
issuing the command on a x86, and the result of the PPC compile
should be to run on x86_64. Maybe you can leave it out, as it's the
same in your case?

> Just to mention one more interesting thing: when---by luck---sge
> reserves the jobs on the same machine (aka smp scheme), all works
> with no problem...

Then it will just create forks - no necessity to use qrsh at all.

> Is there any way to force the ssh before the (...) term???

Using SSH directly would bypass SGE's startup. What are your entries
for qrsh_daemon and so on in SGE's configuration? Which version of SGE?

But I think the real problem is, that Open MPI assumes you are
outside of SGE and so uses a different startup. Are you resetting any
of SGE's environment variables in your custom starter method (like
$JOB_ID)?

-- Reuti

>
> Thanx
> Ondrej
>
>
> Reuti wrote:
>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek:
>>> Hi, thanx for reply...
>>>
>>> I tried to dump the $@ before calling the exec and here it is:
>>>
>>>
>>> ( test ! -r ./.profile || . ./.profile; PATH=/homes/kazi/glembek/
>>> share/openmpi-1.3.3-64/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/
>>> homes/kazi/glembek/share/openmpi-1.3.3-64/lib:$LD_LIBRARY_PATH ;
>>> export LD_LIBRARY_PATH ; /homes/kazi/glembek/share/
>>> openmpi-1.3.3-64/bin/orted -mca ess env -mca orte_ess_jobid
>>> 3870359552 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-
>>> uri "3870359552.0;tcp://147.229.8.134:53727" --mca
>>> pls_gridengine_verbose 1 --output-filename mpi.log )
>>>
>>>
>>> It looks like the line gets constructed in orte/mca/plm/rsh/
>>> plm_rsh_module.c and depends on the shell...
>>>
>>> Still I wonder, why mpiexec calls the starter.sh... I thought the
>>> starter was supposed to call the script which wraps a call to
>>> mpiexec...
>> Correct. This will happen for the master node of this job, i.e.
>> where the jobscript is executed. But it will also be used for the
>> qrsh -inherit calls. I wonder about one thing: I see only a call
>> to "orted" and not the above sub-shell on my machines. Did you
>> compile Open MPI with --with-sge?
>> The original call above would be "ssh node_xy ( test ! ....)"
>> which seems working for ssh and rsh.
>> Just one note: with the starter script you will lose the set PATH
>> and LD_LIBRARY_PATH, as a new shell is created. It might be
>> necessary to set it again in your starter method.
>> -- Reuti
>>>
>>> Am I not right???
>>> Ondrej
>>>
>>>
>>> Reuti wrote:
>>>> Hi,
>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek:
>>>>> we are using a custom starter method in our SGE to launch our
>>>>> jobs... It
>>>>> looks something like this:
>>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> # ... we do whole bunch of stuff here
>>>>>
>>>>> #start the job in thus shell
>>>>> exec "$@"
>>>> the "$@" should be replaced by the path to the jobscript (qsub)
>>>> or command (qrsh) plus the given options.
>>>> For the spread tasks to other nodes I get as argument: " orted -
>>>> mca ess env -mca orte_ess_jobid ...". Also no . ./.profile.
>>>> So I wonder, where the . ./.profile is coming from. Can you put
>>>> a `sleep 60` or alike before the `exec ...` and grep the built
>>>> line from `ps -e f` before it crashes?
>>>> -- Reuti
>>>>> The trouble is that mpiexec passes a command which looks like
>>>>> this:
>>>>>
>>>>> ( . ./.profile ..... )
>>>>>
>>>>> which, however, is not a valid exec argument...
>>>>>
>>>>> Is there any way to tell mpiexec to run it in a separate
>>>>> script??? Any
>>>>> idea how to solve this???
>>>>>
>>>>> Thanx
>>>>> Ondrej Glembek
>>>>>
>>>>> --
>>>>>
>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
>>>>> ~glembek
>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>
>>>>> ICQ: 93233896
>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> --
>>>
>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
>>> ~glembek
>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>> Brno, Czech Republic Fax: +420 54114-1290
>>>
>>> ICQ: 93233896
>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
>
> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
> ~glembek
> Bozetechova 2, 612 66 Phone: +420 54114-1292
> Brno, Czech Republic Fax: +420 54114-1290
>
> ICQ: 93233896
> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users