Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trouble with SGE integration
From: Ondrej Glembek (glembek_at_[hidden])
Date: 2009-12-01 04:00:36


Hi

Reuti wrote:
>>
>> ./configure --prefix=/homes/kazi/glembek/share/openmpi-1.3.3-64
>> --with-sge --enable-shared --enable-static --host=x86_64-linux
>> --build=x86_64-linux NM=x86_64-linux-nm
>
> Is there any list of valid values for --host, --build and NM - and what
> is NM for? From the ./configure --help I would "assume" that one can
> tell Open MPI to prepare to BUILD on a PPC platform, although I'm
> issuing the command on a x86, and the result of the PPC compile should
> be to run on x86_64. Maybe you can leave it out, as it's the same in
> your case?

This is not the problem... We have both 32bit and 64bit machines and the
problem occurs on both (i.e. omitting the --host --build, etc)...

>
>> Is there any way to force the ssh before the (...) term???
>
> Using SSH directly would bypass SGE's startup. What are your entries for
> qrsh_daemon and so on in SGE's configuration? Which version of SGE?

qstat reports version number as "GE 6.2u4"... Below is qconf -sconf dump.

>
> But I think the real problem is, that Open MPI assumes you are outside
> of SGE and so uses a different startup. Are you resetting any of SGE's
> environment variables in your custom starter method (like $JOB_ID)?
I don't think that openmpi doesn't know about SGE when it calls the
starter.sh...

The starter.sh looks like this:

$$$
#!/bin/sh

ulimit -S -c 0
ulimit -S -t unlimited

#echo "$@" >>/pub/tmp/starter.log

#start the job in thus shell
exec "$@"

so no resetting of any kind. Also the open_info looks ok:

$$$
ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)

$$$
qconf -sconf:
qconf -sconf
#global:
execd_spool_dir /usr/local/share/SGE/default/spool
mailer /bin/mail
xterm /usr/bin/xterm
load_sensor /usr/local/share/SGE/util/disk.sh
prolog none
epilog none
shell_start_mode posix_compliant
login_shells sh,ksh,csh,tcsh,bash
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:30
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail linux_at_[hidden]
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
reporting_params accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs 20
gid_range 20000-20100
qlogin_command builtin
qlogin_daemon builtin
rlogin_daemon builtin
max_aj_instances 2000
max_aj_tasks 90000
max_u_jobs 0
max_jobs 0
auto_user_oticket 0
auto_user_fshare 0
auto_user_default_project STD
auto_user_delete_time 0
delegated_file_staging false
rsh_daemon builtin
rsh_command builtin
rlogin_command builtin
reprioritize 0
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w

Thanx

>
> -- Reuti
>
>
>>
>> Thanx
>> Ondrej
>>
>>
>> Reuti wrote:
>>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek:
>>>> Hi, thanx for reply...
>>>>
>>>> I tried to dump the $@ before calling the exec and here it is:
>>>>
>>>>
>>>> ( test ! -r ./.profile || . ./.profile;
>>>> PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export
>>>> PATH ;
>>>> LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib:$LD_LIBRARY_PATH
>>>> ; export LD_LIBRARY_PATH ;
>>>> /homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env
>>>> -mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca
>>>> orte_ess_num_procs 2 --hnp-uri
>>>> "3870359552.0;tcp://147.229.8.134:53727" --mca
>>>> pls_gridengine_verbose 1 --output-filename mpi.log )
>>>>
>>>>
>>>> It looks like the line gets constructed in
>>>> orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell...
>>>>
>>>> Still I wonder, why mpiexec calls the starter.sh... I thought the
>>>> starter was supposed to call the script which wraps a call to
>>>> mpiexec...
>>> Correct. This will happen for the master node of this job, i.e. where
>>> the jobscript is executed. But it will also be used for the qrsh
>>> -inherit calls. I wonder about one thing: I see only a call to
>>> "orted" and not the above sub-shell on my machines. Did you compile
>>> Open MPI with --with-sge?
>>> The original call above would be "ssh node_xy ( test ! ....)" which
>>> seems working for ssh and rsh.
>>> Just one note: with the starter script you will lose the set PATH and
>>> LD_LIBRARY_PATH, as a new shell is created. It might be necessary to
>>> set it again in your starter method.
>>> -- Reuti
>>>>
>>>> Am I not right???
>>>> Ondrej
>>>>
>>>>
>>>> Reuti wrote:
>>>>> Hi,
>>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek:
>>>>>> we are using a custom starter method in our SGE to launch our
>>>>>> jobs... It
>>>>>> looks something like this:
>>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> # ... we do whole bunch of stuff here
>>>>>>
>>>>>> #start the job in thus shell
>>>>>> exec "$@"
>>>>> the "$@" should be replaced by the path to the jobscript (qsub) or
>>>>> command (qrsh) plus the given options.
>>>>> For the spread tasks to other nodes I get as argument: " orted -mca
>>>>> ess env -mca orte_ess_jobid ...". Also no . ./.profile.
>>>>> So I wonder, where the . ./.profile is coming from. Can you put a
>>>>> `sleep 60` or alike before the `exec ...` and grep the built line
>>>>> from `ps -e f` before it crashes?
>>>>> -- Reuti
>>>>>> The trouble is that mpiexec passes a command which looks like this:
>>>>>>
>>>>>> ( . ./.profile ..... )
>>>>>>
>>>>>> which, however, is not a valid exec argument...
>>>>>>
>>>>>> Is there any way to tell mpiexec to run it in a separate script???
>>>>>> Any
>>>>>> idea how to solve this???
>>>>>>
>>>>>> Thanx
>>>>>> Ondrej Glembek
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>>> UPGM FIT VUT Brno, L226 Web:
>>>>>> http://www.fit.vutbr.cz/~glembek
>>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>>
>>>>>> ICQ: 93233896
>>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> --
>>>>
>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>
>>>> ICQ: 93233896
>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>>
>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>> Brno, Czech Republic Fax: +420 54114-1290
>>
>> ICQ: 93233896
>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
  Ondrej Glembek, PhD student  E-mail: glembek_at_[hidden]
  UPGM FIT VUT Brno, L226      Web:    http://www.fit.vutbr.cz/~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290
  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C