Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trouble with SGE integration
From: Ondrej Glembek (glembek_at_[hidden])
Date: 2009-12-01 04:32:43


Just to add more info:

Reuti wrote:
> Am 30.11.2009 um 20:07 schrieb Ondrej Glembek:
>
> But I think the real problem is, that Open MPI assumes you are outside
> of SGE and so uses a different startup. Are you resetting any of SGE's
> environment variables in your custom starter method (like $JOB_ID)?

Also one of the reasons that makes me think that Open MPI knows it is
inside of SGE is the dump of mpiexec (below)

The first four lines show that starter.sh is called from mpiexec, having
trouble with the (...) command...

The last four lines show, that mpiexec knows the machines it is suppose
tu run on...

Thanx

/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
/usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
--------------------------------------------------------------------------
A daemon (pid 30616) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        blade57.fit.vutbr.cz - daemon did not report back when launched
        blade39.fit.vutbr.cz - daemon did not report back when launched
        blade41.fit.vutbr.cz - daemon did not report back when launched
        blade61.fit.vutbr.cz - daemon did not report back when launched

>
> -- Reuti
>
>
>>
>> Thanx
>> Ondrej
>>
>>
>> Reuti wrote:
>>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek:
>>>> Hi, thanx for reply...
>>>>
>>>> I tried to dump the $@ before calling the exec and here it is:
>>>>
>>>>
>>>> ( test ! -r ./.profile || . ./.profile;
>>>> PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export
>>>> PATH ;
>>>> LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib:$LD_LIBRARY_PATH
>>>> ; export LD_LIBRARY_PATH ;
>>>> /homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env
>>>> -mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca
>>>> orte_ess_num_procs 2 --hnp-uri
>>>> "3870359552.0;tcp://147.229.8.134:53727" --mca
>>>> pls_gridengine_verbose 1 --output-filename mpi.log )
>>>>
>>>>
>>>> It looks like the line gets constructed in
>>>> orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell...
>>>>
>>>> Still I wonder, why mpiexec calls the starter.sh... I thought the
>>>> starter was supposed to call the script which wraps a call to
>>>> mpiexec...
>>> Correct. This will happen for the master node of this job, i.e. where
>>> the jobscript is executed. But it will also be used for the qrsh
>>> -inherit calls. I wonder about one thing: I see only a call to
>>> "orted" and not the above sub-shell on my machines. Did you compile
>>> Open MPI with --with-sge?
>>> The original call above would be "ssh node_xy ( test ! ....)" which
>>> seems working for ssh and rsh.
>>> Just one note: with the starter script you will lose the set PATH and
>>> LD_LIBRARY_PATH, as a new shell is created. It might be necessary to
>>> set it again in your starter method.
>>> -- Reuti
>>>>
>>>> Am I not right???
>>>> Ondrej
>>>>
>>>>
>>>> Reuti wrote:
>>>>> Hi,
>>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek:
>>>>>> we are using a custom starter method in our SGE to launch our
>>>>>> jobs... It
>>>>>> looks something like this:
>>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> # ... we do whole bunch of stuff here
>>>>>>
>>>>>> #start the job in thus shell
>>>>>> exec "$@"
>>>>> the "$@" should be replaced by the path to the jobscript (qsub) or
>>>>> command (qrsh) plus the given options.
>>>>> For the spread tasks to other nodes I get as argument: " orted -mca
>>>>> ess env -mca orte_ess_jobid ...". Also no . ./.profile.
>>>>> So I wonder, where the . ./.profile is coming from. Can you put a
>>>>> `sleep 60` or alike before the `exec ...` and grep the built line
>>>>> from `ps -e f` before it crashes?
>>>>> -- Reuti
>>>>>> The trouble is that mpiexec passes a command which looks like this:
>>>>>>
>>>>>> ( . ./.profile ..... )
>>>>>>
>>>>>> which, however, is not a valid exec argument...
>>>>>>
>>>>>> Is there any way to tell mpiexec to run it in a separate script???
>>>>>> Any
>>>>>> idea how to solve this???
>>>>>>
>>>>>> Thanx
>>>>>> Ondrej Glembek
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>>> UPGM FIT VUT Brno, L226 Web:
>>>>>> http://www.fit.vutbr.cz/~glembek
>>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>>
>>>>>> ICQ: 93233896
>>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> --
>>>>
>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>
>>>> ICQ: 93233896
>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>>
>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>> Brno, Czech Republic Fax: +420 54114-1290
>>
>> ICQ: 93233896
>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
  Ondrej Glembek, PhD student  E-mail: glembek_at_[hidden]
  UPGM FIT VUT Brno, L226      Web:    http://www.fit.vutbr.cz/~glembek
  Bozetechova 2, 612 66        Phone:  +420 54114-1292
  Brno, Czech Republic         Fax:    +420 54114-1290
  ICQ: 93233896
  GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C