Hi Ralph,
Â
Thanks for your feedback.
No, this is the other way around, the âreservedâ slots on all nodes are ok but the âusedâ slots are different.
Â
Basically, Iâm using SGE to schedule and book resources for a distributed job. When the job is finally launched, it uses a different allocation than the one that was reported by pls_gridengine_info.
Â
pls_grid_engine_info report states that 3 nodes were booked: barney (1 slot), carl (1 slot) and charlie (2 slots). This booking was done by sge depending on the memory requirements of the job (among others).
Â
When orterun starts the jobs (i.e. when Sge finally start the scheduled job), it uses 3 nodes but the first one (barney: 2 slots instead of 1) is oversubscribed and the last one (carl: 1 slot instead of 2) is underused.
Â
If you need further information, please let me know.
Â
Eloi
Â
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain
Sent: mardi 3 avril 2012 15:58
To: Open MPI Users
Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
Â
I'm afraid there isn't enough info here to help. Are you saying you only allocated one slot/node, so the two slots on charlie is in error?
Sent from my iPad
On Apr 3, 2012, at 6:23 AM, "Eloi Gaudry" <eloi.gaudry_at_[hidden] <mailto:eloi.gaudry_at_[hidden]> > wrote:
Hi,
Â
Iâve observed a strange behavior during rank allocation on a distributed run schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4.
Briefly, there is a one-slot difference between allocated rank/slot for Sge and OpenMPI. The issue here is that one node becomes oversubscribed at runtime.
Â
Here is the output of the allocation done for gridengine:
Â
======================Â Â ALLOCATED NODESÂ Â ======================
Â
Data for node: Name: barney               Launch id: -1     Arch: ffc91200  State: 2
              Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
              Daemon: [[22904,0],0] Daemon launched: True
              Num slots: 1     Slots in use: 0
              Num slots allocated: 1  Max slots: 0
              Username on node: NULL
              Num procs: 0    Next node_rank: 0
Data for node: Name: carl.fft                Launch id: -1     Arch: 0 State: 2
              Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
              Daemon: Not defined  Daemon launched: False
              Num slots: 1     Slots in use: 0
              Num slots allocated: 1  Max slots: 0
              Username on node: NULL
              Num procs: 0    Next node_rank: 0
Data for node: Name: charlie.fft                          Launch id: -1     Arch: 0 State: 2
              Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
              Daemon: Not defined  Daemon launched: False
              Num slots: 2     Slots in use: 0
              Num slots allocated: 2  Max slots: 0
              Username on node: NULL
              Num procs: 0    Next node_rank: 0
Â
Â
And here is the allocation finally used:
=================================================================
Â
Map generated by mapping policy: 0200
              Npernode: 0     Oversubscribe allowed: TRUE  CPU Lists: FALSE
              Num new daemons: 2 New daemon starting vpid 1
              Num nodes: 3
Â
Data for node: Name: barney               Launch id: -1     Arch: ffc91200  State: 2
              Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
              Daemon: [[22904,0],0] Daemon launched: True
              Num slots: 1     Slots in use: 2
              Num slots allocated: 1  Max slots: 0
              Username on node: NULL
              Num procs: 2    Next node_rank: 2
              Data for proc: [[22904,1],0]
                             Pid: 0    Local rank: 0      Node rank: 0
                             State: 0               App_context: 0               Slot list: NULL
              Data for proc: [[22904,1],3]
                             Pid: 0    Local rank: 1      Node rank: 1
                             State: 0               App_context: 0               Slot list: NULL
Â
Data for node: Name: carl.fft                Launch id: -1     Arch: 0 State: 2
              Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
              Daemon: [[22904,0],1] Daemon launched: False
              Num slots: 1     Slots in use: 1
              Num slots allocated: 1  Max slots: 0
              Username on node: NULL
              Num procs: 1    Next node_rank: 1
              Data for proc: [[22904,1],1]
                             Pid: 0    Local rank: 0      Node rank: 0
                             State: 0               App_context: 0               Slot list: NULL
Â
Data for node: Name: charlie.fft                          Launch id: -1     Arch: 0 State: 2
              Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
              Daemon: [[22904,0],2] Daemon launched: False
              Num slots: 2     Slots in use: 1
              Num slots allocated: 2  Max slots: 0
              Username on node: NULL
              Num procs: 1    Next node_rank: 1
              Data for proc: [[22904,1],2]
                             Pid: 0    Local rank: 0      Node rank: 0
                             State: 0               App_context: 0               Slot list: NULL
Â
Has anyone already encounter the same behavior ?
Is there a simple fix than not using the tight integration mode between Sge and OpenMPI ?
Â
Eloi
Â
_______________________________________________
users mailing list
users_at_[hidden] <mailto:users_at_[hidden]>
http://www.open-mpi.org/mailman/listinfo.cgi/users <http://www.open-mpi.org/mailman/listinfo.cgi/users>
|