Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Trouble with SGE integration
From: Reuti (reuti_at_[hidden])
Date: 2009-12-01 05:50:53


Am 01.12.2009 um 10:32 schrieb Ondrej Glembek:

> Just to add more info:
>
> Reuti wrote:
>> Am 30.11.2009 um 20:07 schrieb Ondrej Glembek:
>>
>> But I think the real problem is, that Open MPI assumes you are
>> outside
>> of SGE and so uses a different startup. Are you resetting any of
>> SGE's
>> environment variables in your custom starter method (like $JOB_ID)?
>
> Also one of the reasons that makes me think that Open MPI knows it is
> inside of SGE is the dump of mpiexec (below)
>
> The first four lines show that starter.sh is called from mpiexec,
> having
> trouble with the (...) command...
>
> The last four lines show, that mpiexec knows the machines it is
> suppose
> tu run on...
>
> Thanx
>
>
>
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found

You are right. So the question remains: why is Open MPI building such
a line at all.

As you found the place in the source, it's done only for certain
shells. And I would assume only in case of an rsh/ssh startup. When
you put a `sleep 60` in your starter script: 1) it will of course
delay the start of the program, but when it gets to 2) mpiexec, you
should see some "qrsh -inherit ..." on the master node of the
parallel job. Are these present?

-- Reuti

> ----------------------------------------------------------------------
> ----
> A daemon (pid 30616) died unexpectedly with status 127 while
> attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> ----------------------------------------------------------------------
> ----
> blade57.fit.vutbr.cz - daemon did not report back when
> launched
> blade39.fit.vutbr.cz - daemon did not report back when
> launched
> blade41.fit.vutbr.cz - daemon did not report back when
> launched
> blade61.fit.vutbr.cz - daemon did not report back when
> launched
>
>
>
>
>>
>> -- Reuti
>>
>>
>>>
>>> Thanx
>>> Ondrej
>>>
>>>
>>> Reuti wrote:
>>>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek:
>>>>> Hi, thanx for reply...
>>>>>
>>>>> I tried to dump the $@ before calling the exec and here it is:
>>>>>
>>>>>
>>>>> ( test ! -r ./.profile || . ./.profile;
>>>>> PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export
>>>>> PATH ;
>>>>> LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib:
>>>>> $LD_LIBRARY_PATH
>>>>> ; export LD_LIBRARY_PATH ;
>>>>> /homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env
>>>>> -mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca
>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>> "3870359552.0;tcp://147.229.8.134:53727" --mca
>>>>> pls_gridengine_verbose 1 --output-filename mpi.log )
>>>>>
>>>>>
>>>>> It looks like the line gets constructed in
>>>>> orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell...
>>>>>
>>>>> Still I wonder, why mpiexec calls the starter.sh... I thought the
>>>>> starter was supposed to call the script which wraps a call to
>>>>> mpiexec...
>>>> Correct. This will happen for the master node of this job, i.e.
>>>> where
>>>> the jobscript is executed. But it will also be used for the qrsh
>>>> -inherit calls. I wonder about one thing: I see only a call to
>>>> "orted" and not the above sub-shell on my machines. Did you compile
>>>> Open MPI with --with-sge?
>>>> The original call above would be "ssh node_xy ( test ! ....)" which
>>>> seems working for ssh and rsh.
>>>> Just one note: with the starter script you will lose the set
>>>> PATH and
>>>> LD_LIBRARY_PATH, as a new shell is created. It might be
>>>> necessary to
>>>> set it again in your starter method.
>>>> -- Reuti
>>>>>
>>>>> Am I not right???
>>>>> Ondrej
>>>>>
>>>>>
>>>>> Reuti wrote:
>>>>>> Hi,
>>>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek:
>>>>>>> we are using a custom starter method in our SGE to launch our
>>>>>>> jobs... It
>>>>>>> looks something like this:
>>>>>>>
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> # ... we do whole bunch of stuff here
>>>>>>>
>>>>>>> #start the job in thus shell
>>>>>>> exec "$@"
>>>>>> the "$@" should be replaced by the path to the jobscript
>>>>>> (qsub) or
>>>>>> command (qrsh) plus the given options.
>>>>>> For the spread tasks to other nodes I get as argument: " orted
>>>>>> -mca
>>>>>> ess env -mca orte_ess_jobid ...". Also no . ./.profile.
>>>>>> So I wonder, where the . ./.profile is coming from. Can you put a
>>>>>> `sleep 60` or alike before the `exec ...` and grep the built line
>>>>>> from `ps -e f` before it crashes?
>>>>>> -- Reuti
>>>>>>> The trouble is that mpiexec passes a command which looks like
>>>>>>> this:
>>>>>>>
>>>>>>> ( . ./.profile ..... )
>>>>>>>
>>>>>>> which, however, is not a valid exec argument...
>>>>>>>
>>>>>>> Is there any way to tell mpiexec to run it in a separate
>>>>>>> script???
>>>>>>> Any
>>>>>>> idea how to solve this???
>>>>>>>
>>>>>>> Thanx
>>>>>>> Ondrej Glembek
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>>>> UPGM FIT VUT Brno, L226 Web:
>>>>>>> http://www.fit.vutbr.cz/~glembek
>>>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>>>
>>>>>>> ICQ: 93233896
>>>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> --
>>>>>
>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
>>>>> ~glembek
>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>
>>>>> ICQ: 93233896
>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> --
>>>
>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
>>> ~glembek
>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>> Brno, Czech Republic Fax: +420 54114-1290
>>>
>>> ICQ: 93233896
>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
>
> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
> ~glembek
> Bozetechova 2, 612 66 Phone: +420 54114-1292
> Brno, Czech Republic Fax: +420 54114-1290
>
> ICQ: 93233896
> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users