Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Trouble with SGE integration
From: Reuti (reuti_at_[hidden])
Date: 2009-12-01 05:50:53


Am 01.12.2009 um 10:32 schrieb Ondrej Glembek:

> Just to add more info:
>
> Reuti wrote:
>> Am 30.11.2009 um 20:07 schrieb Ondrej Glembek:
>>
>> But I think the real problem is, that Open MPI assumes you are
>> outside
>> of SGE and so uses a different startup. Are you resetting any of
>> SGE's
>> environment variables in your custom starter method (like $JOB_ID)?
>
> Also one of the reasons that makes me think that Open MPI knows it is
> inside of SGE is the dump of mpiexec (below)
>
> The first four lines show that starter.sh is called from mpiexec,
> having
> trouble with the (...) command...
>
> The last four lines show, that mpiexec knows the machines it is
> suppose
> tu run on...
>
> Thanx
>
>
>
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found
> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found

You are right. So the question remains: why is Open MPI building such
a line at all.

As you found the place in the source, it's done only for certain
shells. And I would assume only in case of an rsh/ssh startup. When
you put a `sleep 60` in your starter script: 1) it will of course
delay the start of the program, but when it gets to 2) mpiexec, you
should see some "qrsh -inherit ..." on the master node of the
parallel job. Are these present?

-- Reuti

> ----------------------------------------------------------------------
> ----
> A daemon (pid 30616) died unexpectedly with status 127 while
> attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> ----------------------------------------------------------------------
> ----
> blade57.fit.vutbr.cz - daemon did not report back when
> launched
> blade39.fit.vutbr.cz - daemon did not report back when
> launched
> blade41.fit.vutbr.cz - daemon did not report back when
> launched
> blade61.fit.vutbr.cz - daemon did not report back when
> launched
>
>
>
>
>>
>> -- Reuti
>>
>>
>>>
>>> Thanx
>>> Ondrej
>>>
>>>
>>> Reuti wrote:
>>>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek:
>>>>> Hi, thanx for reply...
>>>>>
>>>>> I tried to dump the $@ before calling the exec and here it is:
>>>>>
>>>>>
>>>>> ( test ! -r ./.profile || . ./.profile;
>>>>> PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export
>>>>> PATH ;
>>>>> LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib:
>>>>> $LD_LIBRARY_PATH
>>>>> ; export LD_LIBRARY_PATH ;
>>>>> /homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env
>>>>> -mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca
>>>>> orte_ess_num_procs 2 --hnp-uri
>>>>> "3870359552.0;tcp://147.229.8.134:53727" --mca
>>>>> pls_gridengine_verbose 1 --output-filename mpi.log )
>>>>>
>>>>>
>>>>> It looks like the line gets constructed in
>>>>> orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell...
>>>>>
>>>>> Still I wonder, why mpiexec calls the starter.sh... I thought the
>>>>> starter was supposed to call the script which wraps a call to
>>>>> mpiexec...
>>>> Correct. This will happen for the master node of this job, i.e.
>>>> where
>>>> the jobscript is executed. But it will also be used for the qrsh
>>>> -inherit calls. I wonder about one thing: I see only a call to
>>>> "orted" and not the above sub-shell on my machines. Did you compile
>>>> Open MPI with --with-sge?
>>>> The original call above would be "ssh node_xy ( test ! ....)" which
>>>> seems working for ssh and rsh.
>>>> Just one note: with the starter script you will lose the set
>>>> PATH and
>>>> LD_LIBRARY_PATH, as a new shell is created. It might be
>>>> necessary to
>>>> set it again in your starter method.
>>>> -- Reuti
>>>>>
>>>>> Am I not right???
>>>>> Ondrej
>>>>>
>>>>>
>>>>> Reuti wrote:
>>>>>> Hi,
>>>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek:
>>>>>>> we are using a custom starter method in our SGE to launch our
>>>>>>> jobs... It
>>>>>>> looks something like this:
>>>>>>>
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> # ... we do whole bunch of stuff here
>>>>>>>
>>>>>>> #start the job in thus shell
>>>>>>> exec "$@"
>>>>>> the "$@" should be replaced by the path to the jobscript
>>>>>> (qsub) or
>>>>>> command (qrsh) plus the given options.
>>>>>> For the spread tasks to other nodes I get as argument: " orted
>>>>>> -mca
>>>>>> ess env -mca orte_ess_jobid ...". Also no . ./.profile.
>>>>>> So I wonder, where the . ./.profile is coming from. Can you put a
>>>>>> `sleep 60` or alike before the `exec ...` and grep the built line
>>>>>> from `ps -e f` before it crashes?
>>>>>> -- Reuti
>>>>>>> The trouble is that mpiexec passes a command which looks like
>>>>>>> this:
>>>>>>>
>>>>>>> ( . ./.profile ..... )
>>>>>>>
>>>>>>> which, however, is not a valid exec argument...
>>>>>>>
>>>>>>> Is there any way to tell mpiexec to run it in a separate
>>>>>>> script???
>>>>>>> Any
>>>>>>> idea how to solve this???
>>>>>>>
>>>>>>> Thanx
>>>>>>> Ondrej Glembek
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>>>> UPGM FIT VUT Brno, L226 Web:
>>>>>>> http://www.fit.vutbr.cz/~glembek
>>>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>>>
>>>>>>> ICQ: 93233896
>>>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> --
>>>>>
>>>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
>>>>> ~glembek
>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>>>> Brno, Czech Republic Fax: +420 54114-1290
>>>>>
>>>>> ICQ: 93233896
>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> --
>>>
>>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
>>> ~glembek
>>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>>> Brno, Czech Republic Fax: +420 54114-1290
>>>
>>> ICQ: 93233896
>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
>
> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/
> ~glembek
> Bozetechova 2, 612 66 Phone: +420 54114-1292
> Brno, Czech Republic Fax: +420 54114-1290
>
> ICQ: 93233896
> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users