Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trouble with SGE integration
From: Ondrej Glembek (glembek_at_[hidden])
Date: 2009-12-01 08:57:17


Hi
We have solved the problem by rewriting the starter.sh...

The script remained the same except for the very final part where
command is executed... Instead of plain exec "$@", we replaced it by:

==========
#need for exec to fail on non-script jobs
shopt -s execfail

#start the job in thus shell
exec "$@"

#if the job is not sheel script but bash command, try to evaluate it
eval "$@"
==========

The error message still appears in the log file, but otherwise all seems
ok...

Thanx
Ondrej

Reuti wrote:
> Am 01.12.2009 um 10:32 schrieb Ondrej Glembek:
>
>> Just to add more info:
>>
>> Reuti wrote:
>>> Am 30.11.2009 um 20:07 schrieb Ondrej Glembek:
>>>
>>> But I think the real problem is, that Open MPI assumes you are outside
>>> of SGE and so uses a different startup. Are you resetting any of SGE's
>>> environment variables in your custom starter method (like $JOB_ID)?
>>
>> Also one of the reasons that makes me think that Open MPI knows it is
>> inside of SGE is the dump of mpiexec (below)
>>
>> The first four lines show that starter.sh is called from mpiexec, having
>> trouble with the (...) command...
>>
>> The last four lines show, that mpiexec knows the machines it is suppose
>> tu run on...
>>
>> Thanx
>>
>>
>>
>> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
>> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
>> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
>> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
>
> You are right. So the question remains: why is Open MPI building such a
> line at all.
>
> As you found the place in the source, it's done only for certain shells.
> And I would assume only in case of an rsh/ssh startup. When you put a
> `sleep 60` in your starter script: 1) it will of course delay the start
> of the program, but when it gets to 2) mpiexec, you should see some
> "qrsh -inherit ..." on the master node of the parallel job. Are these
> present?
>
> -- Reuti
>
>
>> --------------------------------------------------------------------------
>>
>> A daemon (pid 30616) died unexpectedly with status 127 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>> the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>>
>> blade57.fit.vutbr.cz - daemon did not report back when launched
>> blade39.fit.vutbr.cz - daemon did not report back when launched
>> blade41.fit.vutbr.cz - daemon did not report back when launched
>> blade61.fit.vutbr.cz - daemon did not report back when launched
>>
>>
>>
>>
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> Thanx
>>>> Ondrej
>>>>
>>>>
>>>> Reuti wrote:
>>>>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek:
>>>>>> Hi, thanx for reply...
>>>>>>
>>>>>> I tried to dump the $@ before calling the exec and here it is:
>>>>>>
>>>>>>
>>>>>> ( test ! -r ./.profile || . ./.profile;
>>>>>> PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export
>>>>>> PATH ;
>>>>>> LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib:$LD_LIBRARY_PATH
>>>>>>
>>>>>> ; export LD_LIBRARY_PATH ;
>>>>>> /homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env
>>>>>> -mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca
>>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>>> "3870359552.0;tcp://147.229.8.134:53727" --mca
>>>>>> pls_gridengine_verbose 1 --output-filename mpi.log )
>>>>>>
>>>>>>
>>>>>> It looks like the line gets constructed in
>>>>>> orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell...
>>>>>>
>>>>>> Still I wonder, why mpiexec calls the starter.sh... I thought the
>>>>>> starter was supposed to call the script which wraps a call to
>>>>>> mpiexec...
>>>>> Correct. This will happen for the master node of this job, i.e. where
>>>>> the jobscript is executed. But it will also be used for the qrsh
>>>>> -inherit calls. I wonder about one thing: I see only a call to
>>>>> "orted" and not the above sub-shell on my machines. Did you compile
>>>>> Open MPI with --with-sge?
>>>>> The original call above would be "ssh node_xy ( test ! ....)" which
>>>>> seems working for ssh and rsh.
>>>>> Just one note: with the starter script you will lose the set PATH and
>>>>> LD_LIBRARY_PATH, as a new shell is created. It might be necessary to
>>>>> set it again in your starter method.
>>>>> -- Reuti
>>>>>>
>>>>>> Am I not right???
>>>>>> Ondrej
>>>>>>
>>>>>>
>>>>>> Reuti wrote:
>>>>>>> Hi,
>>>>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek:
>>>>>>>> we are using a custom starter method in our SGE to launch our
>>>>>>>> jobs... It
>>>>>>>> looks something like this:
>>>>>>>>
>>>>>>>> #!/bin/sh
>>>>>>>>
>>>>>>>> # ... we do whole bunch of stuff here
>>>>>>>>
>>>>>>>> #start the job in thus shell
>>>>>>>> exec "$@"
>>>>>>> the "$@" should be replaced by the path to the jobscript (qsub) or
>>>>>>> command (qrsh) plus the given options.
>>>>>>> For the spread tasks to other nodes I get as argument: " orted -mca
>>>>>>> ess env -mca orte_ess_jobid ...". Also no . ./.profile.
>>>>>>> So I wonder, where the . ./.profile is coming from. Can you put a
>>>>>>> `sleep 60` or alike before the `exec ...` and grep the built line
>>>>>>> from `ps -e f` before it crashes?
>>>>>>> -- Reuti
>>>>>>>> The trouble is that mpiexec passes a command which looks like this:
>>>>>>>>
>>>>>>>> ( . ./.profile ..... )
>>>>>>>>
>>>>>>>> which, however, is not a valid exec argument...
>>>>>>>>
>>>>>>>> Is there any way to tell mpiexec to run it in a separate script???
>>>>>>>> Any
>>>>>>>> idea how to solve this???
>>>>>>>>
>>>>>>>> Thanx
>>>>>>>> Ondrej Glembek
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>>>>> UPGM FIT VUT Brno, L226 Web:
>>>>>>>> http://www.fit.vutbr.cz/~glembek
>>>>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>>>>
>>>>>>>> ICQ: 93233896
>>>>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>>> UPGM FIT VUT Brno, L226 Web:
>>>>>> http://www.fit.vutbr.cz/~glembek
>>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>>
>>>>>> ICQ: 93233896
>>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> --
>>>>
>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>
>>>> ICQ: 93233896
>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>>
>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>> Brno, Czech Republic Fax: +420 54114-1290
>>
>> ICQ: 93233896
>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
  Ondrej Glembek, PhD student  E-mail: glembek_at_[hidden]
  UPGM FIT VUT Brno, L226      Web:    http://www.fit.vutbr.cz/~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290
  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C