Am 07.07.2008 um 11:31 schrieb Romaric David:
> Pak Lui a écrit :
>> It was fixed at one point in the trunk before v1.3 went official,
>> but while rolling the code from gridengine PLM into the rsh PLM
>> code, this feature was left out because there was some lingering
>> issues that I didn't resolved and I lost track of it. Sorry but
>> thanks for bringing it up, I will need to look at the issue again
>> and reopen this ticket against v1.3:
> Ok, so I have to wait for a 1.3 version to work with job suspend, or
> will it be back-ported to 1.2.6 or 1.2.6 ?
>> So even it is the rsh PLM that starts the parallel job under SGE,
>> the rsh PLM can detect if the Open MPI job is started under the
>> SGE Parallel Environment (via checking some SGE env vars) and use
>> the "qrsh --inherit" command to launch the parallel job the same
>> way as it was before. You can check by setting MCA to something
>> like "--mca plm_base_verbose 10" in your mpirun command and look
>> for the launch commands that mpirun uses.
> It looks like shepherd cannot be started for a reason I couldn't
> get yet.
> /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
> reading exit code from shepherd ... 255
> [hostname:16745] ----------------------------
you mean with the plain rsh startup, like a loose integration? Isn't
in this case a proper hostlist necessary, which is for other MPI
implementations built in the start_proc_args defined routine? AFAIK
you can disregard the hostlist only with Open MPI's tight SGE support.