Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads
From: Lengyel, Florian (flengyel_at_[hidden])
Date: 2009-07-08 18:42:05


This was addressed to the Open MPI list; on the SGE
list you suggested changing the pe allocation rule from full_up$ to
pe_slots$; the pe is now

[flengyel_at_nept OPENMPI]$ qconf -sp ompi
pe_name ompi
slots 999
user_lists Research
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min

but the result is the same:

[flengyel_at_nept OPENMPI]$ tail -f sum.e23310
Starting server daemon at host "m18.gc.cuny.edu"
Server daemon successfully started with task id "1.m18"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
m18.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m18.gc.cuny.edu:26399] ERROR: A daemon on node m18.gc.cuny.edu failed to
start as expected.
[m18.gc.cuny.edu:26399] ERROR: There may be more information available from
[m18.gc.cuny.edu:26399] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[m18.gc.cuny.edu:26399] ERROR: If the problem persists, please restart the
[m18.gc.cuny.edu:26399] ERROR: Grid Engine PE job
[m18.gc.cuny.edu:26399] ERROR: The daemon exited unexpectedly with status
129.

On Tue, Jul 7, 2009 at 5:05 PM, Reuti <reuti_at_[hidden]> wrote:

> Hi,
>
> Am 07.07.2009 um 22:12 schrieb Lengyel, Florian:
>
> Hi,
>> I may have overlooked something in the archives (not to mention
>> Googling)--if so I apologize, however
>> I have been unable to find info on this particular problem.
>>
>> OpenMPI+SGE tight integration works on E6600 core duo systems but not on
>> Q9550 quads.
>> Could use some troubleshooting assistance. Thanks.
>>
>> Is this what you found our your question?
>
> I'm not aware of this. What should be the cause of it?!? Do you have a link
> - was it on the SGE list?
>
> -- Reuti
>
>
>> I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.
>>
>> OpenMPI was compiled with SGE, and the required components are present:
>>
>> [flengyel_at_nept OPENMPI]$ ompi_info | grep gridengine
>> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
>> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
>>
>>
>> The parallel execution environment for OpenMPI is as follows:
>>
>> [flengyel_at_nept OPENMPI]$ qconf -sp ompi
>> pe_name ompi
>> slots 999
>> user_lists Research
>> xuser_lists NONE
>> start_proc_args /bin/true
>> stop_proc_args /bin/true
>> allocation_rule $fill_up
>> control_slaves TRUE
>> job_is_first_task FALSE
>> urgency_slots min
>>
>> A trivial OpenMPI job using this pe will run on a queue for Intel E6600
>> core duo machines:
>>
>> [flengyel_at_nept OPENMPI]$ cat sum2.sh
>>
>> #!/bin/bash
>> #$ -S /bin/bash
>> #$ -q x86_64.q
>> #$ -N sum
>> #$ -pe ompi 4
>>
>> #$ -cwd
>>
>> export PATH=/home/nept/apps64/openmpi/bin:$PATH
>> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
>> . /usr/local/sge/default/common/settings.sh
>> mpirun --mca pls_gridengine_verbose 2 --prefix /home/nept/apps64/openmpi
>> -v ./sum
>>
>> Here are the results:
>>
>> [flengyel_at_nept OPENMPI]$ qsub sum2.sh
>> Your job 23194 ("sum") has been submitted
>>
>> [flengyel_at_nept OPENMPI]$ qstat -r -u flengyel
>>
>> job-ID prior name user state submit/start at queue
>> slots ja-task-ID
>>
>> -----------------------------------------------------------------------------------------------------------------
>> 23194 0.25007 sum flengyel r 07/07/2009 14:14:40
>> x86_64.q_at_[hidden] 4
>> Full jobname: sum
>> Master queue: x86_64.q_at_[hidden]
>> Requested PE: ompi 4
>> Granted PE: ompi 4
>> Hard Resources:
>> Soft Resources:
>> Hard requested queues: x86_64.q
>>
>>
>> [flengyel_at_nept OPENMPI]$ more sum.o23194
>>
>> The sum from 1 to 1000 is: 500500
>> [flengyel_at_nept OPENMPI]$ more sum.e23194
>> Starting server daemon at host "m49.gc.cuny.edu"
>> Starting server daemon at host "m33.gc.cuny.edu"
>> Server daemon successfully started with task id "1.m49"
>> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
>> m49.gc.cuny.edu ...
>> Server daemon successfully started with task id "1.m33"
>> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
>> m33.gc.cuny.edu ...
>> /usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
>> reading exit code from shepherd ...
>>
>> But the same job with the queue set to quad.q for the Q9550 quad core
>> machines
>> has daemon trouble:
>>
>>
>> [flengyel_at_nept OPENMPI]$ !qstat
>> qstat -r -u flengyel
>> job-ID prior name user state submit/start at queue
>> slots ja-task-ID
>>
>> -----------------------------------------------------------------------------------------------------------------
>> 23196 0.25000 sum flengyel r 07/07/2009 14:26:21
>> quad.q_at_[hidden] 2
>> Full jobname: sum
>> Master queue: quad.q_at_[hidden]
>> Requested PE: ompi 2
>> Granted PE: ompi 2
>> Hard Resources:
>> Soft Resources:
>> Hard requested queues: quad.q
>> [flengyel_at_nept OPENMPI]$ more sum.e23196
>> Starting server daemon at host "m15.gc.cuny.edu"
>> Starting server daemon at host "m09.gc.cuny.edu"
>> Server daemon successfully started with task id "1.m15"
>> Server daemon successfully started with task id "1.m09"
>> Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host
>> m15.gc.cuny.e
>> du ...
>> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
>> reading exit code from shepherd ... Establishing
>> /usr/local/sge/utilbin/lx24-amd
>> 64/rsh session to host m09.gc.cuny.edu ...
>> /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
>> reading exit code from shepherd ... 129
>> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to
>> start
>> as expected.
>> [m09.gc.cuny.edu:11413] ERROR: There may be more information available
>> from
>> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine
>> tasks.
>> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart
>> the
>> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
>> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status
>> 129.
>> 129
>> [m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to
>> start
>> as expected.
>> [m09.gc.cuny.edu:11413] ERROR: There may be more information available
>> from
>> [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine
>> tasks.
>> [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart
>> the
>> [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
>> [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status
>> 129.
>> [flengyel_at_nept OPENMPI]$
>>
>>
>> -FL
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do
>> ?dsForumId=38&dsMessageId=206057
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe_at_[hidden]].
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>