Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
From: Reuti (reuti_at_[hidden])
Date: 2012-04-03 11:30:03


Am 03.04.2012 um 17:24 schrieb Eloi Gaudry:

> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Reuti
> Sent: mardi 3 avril 2012 17:13
> To: Open MPI Users
> Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
>
> Am 03.04.2012 um 16:59 schrieb Eloi Gaudry:
>
>> Hi Reuti,
>>
>> I configured OpenMPI to support SGE tight integration and used the defined below PE for submitting the job:
>>
>> [16:36][eg_at_moe:~]$ qconf -sp fill_up
>> pe_name fill_up
>> slots 80
>> user_lists NONE
>> xuser_lists NONE
>> start_proc_args /bin/true
>> stop_proc_args /bin/true
>> allocation_rule $fill_up
>
> It should fill a host completely before moving to the next one with this definition.
> [eg: ] yes, and it should also make sure that all hard requirements are met. Note that the allocation done by sge is correct here, this is what is finally done by openmpi at startup that is different (and incorrect).
>
>
>> control_slaves TRUE
>> job_is_first_task FALSE
>> urgency_slots min
>> accounting_summary FALSE
>>
>> Here are the allocation info retrieved from `qstat -g t` for the related job:
>
> For me the output of `qstat -g t` shows MASTER and SLAVE entries but no variables. Is there any wrapper defined for `qstat` to reformat the output (or a ~/.sge_qstat defined)?
>
> [eg: ] sorry, i forgot about sge_qstat being defined. As I don't have any slot available right now, I cannot relaunch the job to get the output updated.
>
> And why is "num_proc=0" output everywhere - was it redefined (usually it's a load sensor set to the found cores in the machines and shoudn't be touched by hand making it a consumable complex).
>
> [eg: ] my mistake i think, this was made a consumable complex so that we could easily schedule multithread and parallel job on the cluster. I guess I should define another complex (proc_available), make it consumable and consume from this complex instead of touching the num_proc sensor one then...

No. Also a threaded job is a parallel one with allocation_rule $pe_slots, no custom complex necessary. Often such a PE is called "smp".

So, for now we can't solve the initial issue.

-- Reuti

>
> -- Reuti
>
>
>> ---------------------------------------------------------------------------------
>> smp4.q_at_barney.fft BIP 0/1/4 0.70 lx-amd64
>> hc:num_proc=0
>> hl:mem_free=31.215G
>> hl:mem_used=280.996M
>> hc:mem_available=1.715G
>> 1296 0.54786 semi_direc jj r 04/03/2012 16:43:49 1
>> ---------------------------------------------------------------------------------
>> smp4.q_at_carl.fft BIP 0/1/4 0.69 lx-amd64
>> hc:num_proc=0
>> hl:mem_free=30.764G
>> hl:mem_used=742.805M
>> hc:mem_available=1.715G
>> 1296 0.54786 semi_direc jj r 04/03/2012 16:43:49 1
>> ---------------------------------------------------------------------------------
>> smp8.q_at_charlie.fft BIP 0/2/8 0.57 lx-amd64
>> hc:num_proc=0
>> hl:mem_free=62.234G
>> hl:mem_used=836.797M
>> hc:mem_available=4.018G
>> 1296 0.54786 semi_direc jj r 04/03/2012 16:43:49 2
>> ----------------------------------------------------------------------
>> -----------
>>
>> Sge reports whatr pls_gridengine_report does, i.e. what was reserved.
>> But here is the ouput of the current job (after started by openmpi):
>> [charlie:05294] ras:gridengine: JOB_ID: 1296 [charlie:05294]
>> ras:gridengine: PE_HOSTFILE:
>> /opt/sge/default/spool/charlie/active_jobs/1296.1/pe_hostfile
>> [charlie:05294] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=2
>> [charlie:05294] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=1
>> [charlie:05294] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=1
>>
>> ====================== ALLOCATED NODES ======================
>>
>> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>> Daemon: [[54347,0],0] Daemon launched: True Num slots: 2 Slots in
>> use: 0 Num slots allocated: 2 Max slots: 0 Username on node: NULL
>> Num procs: 0 Next node_rank: 0
>> Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>> Daemon: Not defined Daemon launched: False Num slots: 1 Slots in
>> use: 0 Num slots allocated: 1 Max slots: 0 Username on node: NULL
>> Num procs: 0 Next node_rank: 0
>> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>> Daemon: Not defined Daemon launched: False Num slots: 1 Slots in
>> use: 0 Num slots allocated: 1 Max slots: 0 Username on node: NULL
>> Num procs: 0 Next node_rank: 0
>>
>> =================================================================
>>
>> Map generated by mapping policy: 0200
>> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE Num new
>> daemons: 2 New daemon starting vpid 1 Num nodes: 3
>>
>> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>> Daemon: [[54347,0],0] Daemon launched: True Num slots: 2 Slots in
>> use: 2 Num slots allocated: 2 Max slots: 0 Username on node: NULL
>> Num procs: 2 Next node_rank: 2 Data for proc: [[54347,1],0]
>> Pid: 0 Local rank: 0 Node rank: 0
>> State: 0 App_context: 0 Slot list: NULL Data for proc:
>> [[54347,1],3]
>> Pid: 0 Local rank: 1 Node rank: 1
>> State: 0 App_context: 0 Slot list: NULL
>> Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>> Daemon: [[54347,0],1] Daemon launched: False Num slots: 1 Slots in
>> use: 1 Num slots allocated: 1 Max slots: 0 Username on node: NULL
>> Num procs: 1 Next node_rank: 1 Data for proc: [[54347,1],1]
>> Pid: 0 Local rank: 0 Node rank: 0
>> State: 0 App_context: 0 Slot list: NULL
>>
>> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>> Daemon: [[54347,0],2] Daemon launched: False Num slots: 1 Slots in
>> use: 1 Num slots allocated: 1 Max slots: 0 Username on node: NULL
>> Num procs: 1 Next node_rank: 1 Data for proc: [[54347,1],2]
>> Pid: 0 Local rank: 0 Node rank: 0
>> State: 0 App_context: 0 Slot list: NULL
>>
>> Regards,
>> Eloi
>>
>>
>>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On Behalf Of Reuti
>> Sent: mardi 3 avril 2012 16:24
>> To: Open MPI Users
>> Subject: Re: [OMPI users] sge tight intregration leads to bad
>> allocation
>>
>> Hi,
>>
>> Am 03.04.2012 um 16:12 schrieb Eloi Gaudry:
>>
>>> Thanks for your feedback.
>>> No, this is the other way around, the "reserved" slots on all nodes are ok but the "used" slots are different.
>>>
>>> Basically, I'm using SGE to schedule and book resources for a distributed job. When the job is finally launched, it uses a different allocation than the one that was reported by pls_gridengine_info.
>>>
>>> pls_grid_engine_info report states that 3 nodes were booked: barney (1 slot), carl (1 slot) and charlie (2 slots). This booking was done by sge depending on the memory requirements of the job (among others).
>>>
>>> When orterun starts the jobs (i.e. when Sge finally start the scheduled job), it uses 3 nodes but the first one (barney: 2 slots instead of 1) is oversubscribed and the last one (carl: 1 slot instead of 2) is underused.
>>
>> you configured Open MPI to support SGE tight integration and used a PE for submitting the job? Can you please post the defintion of the PE.
>>
>> What was the allocation you saw in SGE's `qstat -g t ` for the job?
>>
>> -- Reuti
>>
>>
>>> If you need further information, please let me know.
>>>
>>> Eloi
>>>
>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>>> On Behalf Of Ralph Castain
>>> Sent: mardi 3 avril 2012 15:58
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] sge tight intregration leads to bad
>>> allocation
>>>
>>> I'm afraid there isn't enough info here to help. Are you saying you only allocated one slot/node, so the two slots on charlie is in error?
>>>
>>> Sent from my iPad
>>>
>>> On Apr 3, 2012, at 6:23 AM, "Eloi Gaudry" <eloi.gaudry_at_[hidden]> wrote:
>>>
>>> Hi,
>>>
>>> I've observed a strange behavior during rank allocation on a distributed run schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4.
>>> Briefly, there is a one-slot difference between allocated rank/slot for Sge and OpenMPI. The issue here is that one node becomes oversubscribed at runtime.
>>>
>>> Here is the output of the allocation done for gridengine:
>>>
>>> ====================== ALLOCATED NODES ======================
>>>
>>> Data for node: Name: barney Launch id: -1 Arch: ffc91200 State: 2
>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>>> Daemon: [[22904,0],0] Daemon launched: True
>>> Num slots: 1 Slots in use: 0
>>> Num slots allocated: 1 Max slots: 0
>>> Username on node: NULL
>>> Num procs: 0 Next node_rank: 0
>>> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>>> Daemon: Not defined Daemon launched: False
>>> Num slots: 1 Slots in use: 0
>>> Num slots allocated: 1 Max slots: 0
>>> Username on node: NULL
>>> Num procs: 0 Next node_rank: 0
>>> Data for node: Name: charlie.fft Launch id: -1 Arch: 0 State: 2
>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>>> Daemon: Not defined Daemon launched: False
>>> Num slots: 2 Slots in use: 0
>>> Num slots allocated: 2 Max slots: 0
>>> Username on node: NULL
>>> Num procs: 0 Next node_rank: 0
>>>
>>>
>>> And here is the allocation finally used:
>>> =================================================================
>>>
>>> Map generated by mapping policy: 0200
>>> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
>>> Num new daemons: 2 New daemon starting vpid 1
>>> Num nodes: 3
>>>
>>> Data for node: Name: barney Launch id: -1 Arch: ffc91200 State: 2
>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>>> Daemon: [[22904,0],0] Daemon launched: True
>>> Num slots: 1 Slots in use: 2
>>> Num slots allocated: 1 Max slots: 0
>>> Username on node: NULL
>>> Num procs: 2 Next node_rank: 2
>>> Data for proc: [[22904,1],0]
>>> Pid: 0 Local rank: 0 Node rank: 0
>>> State: 0 App_context: 0 Slot list: NULL
>>> Data for proc: [[22904,1],3]
>>> Pid: 0 Local rank: 1 Node rank: 1
>>> State: 0 App_context: 0 Slot list: NULL
>>>
>>> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>>> Daemon: [[22904,0],1] Daemon launched: False
>>> Num slots: 1 Slots in use: 1
>>> Num slots allocated: 1 Max slots: 0
>>> Username on node: NULL
>>> Num procs: 1 Next node_rank: 1
>>> Data for proc: [[22904,1],1]
>>> Pid: 0 Local rank: 0 Node rank: 0
>>> State: 0 App_context: 0 Slot list: NULL
>>>
>>> Data for node: Name: charlie.fft Launch id: -1 Arch: 0 State: 2
>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>>> Daemon: [[22904,0],2] Daemon launched: False
>>> Num slots: 2 Slots in use: 1
>>> Num slots allocated: 2 Max slots: 0
>>> Username on node: NULL
>>> Num procs: 1 Next node_rank: 1
>>> Data for proc: [[22904,1],2]
>>> Pid: 0 Local rank: 0 Node rank: 0
>>> State: 0 App_context: 0 Slot list: NULL
>>>
>>> Has anyone already encounter the same behavior ?
>>> Is there a simple fix than not using the tight integration mode between Sge and OpenMPI ?
>>>
>>> Eloi
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users