Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
From: Reuti (reuti_at_[hidden])
Date: 2012-04-05 12:41:18


Am 05.04.2012 um 17:55 schrieb Eloi Gaudry:

>
> >> Here are the allocation info retrieved from `qstat -g t` for the related job:
> >
> > For me the output of `qstat -g t` shows MASTER and SLAVE entries but no variables. Is there any wrapper defined for `qstat` to reformat the output (or a ~/.sge_qstat defined)?
> >
> > [eg: ] sorry, i forgot about sge_qstat being defined. As I don't have any slot available right now, I cannot relaunch the job to get the output updated.
> Reuti, here is the output you asked two days ago.
> It was produced with another "bad" run for which 3 processes are running on nodes charlie and carl... but we should have only 2 processes on carl and 4 on charlie...

This is indeed strange, as it first detects the correct allocation. And it conforms to the one granted.

- You used a plain `mpiexec` without and number of processes or machinesfile?
- Can you please post while it's running the relevant lines from:

ps -e f --cols=500

(f w/o -) from both machines.

It's allocated between the nodes more like in a round-robin fashion.

-- Reuti

>
> Output from qstat -g t:
> ------------------------------------
> queuename qtype resv/used/tot. load_avg arch states
> ---------------------------------------------------------------------------------
> smp4.q_at_carl.fft BIP 0/2/4 1.14 lx-amd64
> hc:mem_available=1.715G
> 1391 0.57643 semi_green jj r 04/05/2012 15:41:04 SLAVE
> SLAVE
> ---------------------------------------------------------------------------------
> smp8.q_at_charlie.fft BIP 0/4/8 1.73 lx-amd64
> hc:mem_available=4.018G
> 1391 0.57643 semi_green jj r 04/05/2012 15:41:04 MASTER
> SLAVE
> SLAVE
> SLAVE
> SLAVE
>
> Debug output from orterun:
> ------------------------------------
> [charlie:08194] ras:gridengine: JOB_ID: 1391
> [charlie:08194] ras:gridengine: PE_HOSTFILE: /opt/sge/default/spool/charlie/active_jobs/1391.1/pe_hostfile
> [charlie:08194] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=4
> [charlie:08194] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=2
>
> ====================== ALLOCATED NODES ======================
>
> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: [[57575,0],0] Daemon launched: True
> Num slots: 4 Slots in use: 0
> Num slots allocated: 4 Max slots: 0
> Username on node: NULL
> Num procs: 0 Next node_rank: 0
> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: Not defined Daemon launched: False
> Num slots: 2 Slots in use: 0
> Num slots allocated: 2 Max slots: 0
> Username on node: NULL
> Num procs: 0 Next node_rank: 0
>
> =================================================================
>
> Map generated by mapping policy: 0200
> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
> Num new daemons: 1 New daemon starting vpid 1
> Num nodes: 2
>
> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: [[57575,0],0] Daemon launched: True
> Num slots: 4 Slots in use: 3
> Num slots allocated: 4 Max slots: 0
> Username on node: NULL
> Num procs: 3 Next node_rank: 3
> Data for proc: [[57575,1],0]
> Pid: 0 Local rank: 0 Node rank: 0
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57575,1],2]
> Pid: 0 Local rank: 1 Node rank: 1
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57575,1],4]
> Pid: 0 Local rank: 2 Node rank: 2
> State: 0 App_context: 0 Slot list: NULL
>
> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: [[57575,0],1] Daemon launched: False
> Num slots: 2 Slots in use: 3
> Num slots allocated: 2 Max slots: 0
> Username on node: NULL
> Num procs: 3 Next node_rank: 3
> Data for proc: [[57575,1],1]
> Pid: 0 Local rank: 0 Node rank: 0
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57575,1],3]
> Pid: 0 Local rank: 1 Node rank: 1
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57575,1],5]
> Pid: 0 Local rank: 2 Node rank: 2
> State: 0 App_context: 0 Slot list: NULL
>
>
>
> Regards,
> Eloi
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users