> - Can you please post while it's running the relevant lines from:
> ps -e f --cols=500
> (f w/o -) from both machines.
> It's allocated between the nodes more like in a round-robin fashion.
> [eg: ] I'll try to do this tomorrow, as soon as some slots become free. Thanks for your feedback Reuti, I appreciate.
hi reuti, here is the information related to another run that is failing in the same way:
qstat -g t:
------------
---------------------------------------------------------------------------------
smp4.q@barney.fft BIP 0/3/4 3.37 lx-amd64
hc:mem_available=1.715G
hc:proc_available=1
1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE
SLAVE
SLAVE
---------------------------------------------------------------------------------
smp4.q@carl.fft BIP 0/3/4 3.44 lx-amd64
hc:mem_available=1.715G
hc:proc_available=1
1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE
SLAVE
SLAVE
---------------------------------------------------------------------------------
smp8.q@charlie.fft BIP 0/6/8 3.46 lx-amd64
hc:mem_available=4.018G
hc:proc_available=2
1416 0.60500 semi_green jj r 04/06/2012 11:57:34 MASTER
SLAVE
SLAVE
SLAVE
SLAVE
SLAVE
SLAVE
barney: ps -e f --cols=500:
-----------------------------------
2048 ? Sl 3:33 /opt/sge/bin/lx-amd64/sge_execd
27502 ? Sl 0:00 \_ sge_shepherd-1416 -bg
27503 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney
27510 ? S 0:00 \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e
nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
27511 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
27512 ? Rl 12:54 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
27513 ? Rl 12:54 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
carl: ps -e f --cols=500:
-------------------------------
1928 ? Sl 3:10 /opt/sge/bin/lx-amd64/sge_execd
29022 ? Sl 0:00 \_ sge_shepherd-1416 -bg
29023 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/carl/active_jobs/1416.1/1.carl
29030 ? S 0:00 \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e
nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
29031 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
29032 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29033 ? Rl 13:50 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29034 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29035 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
charlie: ps -e f --cols=500:
-----------------------------------
1591 ? Sl 3:13 /opt/sge/bin/lx-amd64/sge_execd
8793 ? S 0:00 \_ sge_shepherd-1416 -bg
8795 ? Ss 0:00 \_ -bash /opt/sge/default/spool/charlie/job_scripts/1416
8800 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orterun --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 --bynode -report-bindings -display-map -display-devel-map -display-allocation -display-devel-allocation -np 12 -x ACTRAN_LICENSE -x ACTRAN_PRODUCTLINE -x LD_LIBRARY_PATH -x PATH -x ACTRAN_DEBUG /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parall
8801 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V barney.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose
8802 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V carl.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
8807 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
8808 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
8809 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
8810 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
oretrun information:
--------------------------
[charlie:08800] ras:gridengine: JOB_ID: 1416
[charlie:08800] ras:gridengine: PE_HOSTFILE: /opt/sge/default/spool/charlie/active_jobs/1416.1/pe_hostfile
[charlie:08800] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6
[charlie:08800] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=3
[charlie:08800] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3
====================== ALLOCATED NODES ======================
Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
Daemon: [[57989,0],0] Daemon launched: True
Num slots: 6 Slots in use: 0
Num slots allocated: 6 Max slots: 0
Username on node: NULL
Num procs: 0 Next node_rank: 0
Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2
Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
Daemon: Not defined Daemon launched: False
Num slots: 3 Slots in use: 0
Num slots allocated: 3 Max slots: 0
Username on node: NULL
Num procs: 0 Next node_rank: 0
Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
Daemon: Not defined Daemon launched: False
Num slots: 3 Slots in use: 0
Num slots allocated: 3 Max slots: 0
Username on node: NULL
Num procs: 0 Next node_rank: 0
=================================================================
Map generated by mapping policy: 0200
Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
Num new daemons: 2 New daemon starting vpid 1
Num nodes: 3
Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
Daemon: [[57989,0],0] Daemon launched: True
Num slots: 6 Slots in use: 4
Num slots allocated: 6 Max slots: 0
Username on node: NULL
Num procs: 4 Next node_rank: 4
Data for proc: [[57989,1],0]
Pid: 0 Local rank: 0 Node rank: 0
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],3]
Pid: 0 Local rank: 1 Node rank: 1
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],6]
Pid: 0 Local rank: 2 Node rank: 2
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],9]
Pid: 0 Local rank: 3 Node rank: 3
State: 0 App_context: 0 Slot list: NULL
Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2
Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
Daemon: [[57989,0],1] Daemon launched: False
Num slots: 3 Slots in use: 4
Num slots allocated: 3 Max slots: 0
Username on node: NULL
Num procs: 4 Next node_rank: 4
Data for proc: [[57989,1],1]
Pid: 0 Local rank: 0 Node rank: 0
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],4]
Pid: 0 Local rank: 1 Node rank: 1
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],7]
Pid: 0 Local rank: 2 Node rank: 2
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],10]
Pid: 0 Local rank: 3 Node rank: 3
State: 0 App_context: 0 Slot list: NULL
Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
Daemon: [[57989,0],2] Daemon launched: False
Num slots: 3 Slots in use: 4
Num slots allocated: 3 Max slots: 0
Username on node: NULL
Num procs: 4 Next node_rank: 4
Data for proc: [[57989,1],2]
Pid: 0 Local rank: 0 Node rank: 0
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],5]
Pid: 0 Local rank: 1 Node rank: 1
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],8]
Pid: 0 Local rank: 2 Node rank: 2
State: 0 App_context: 0 Slot list: NULL
Data for proc: [[57989,1],11]
Pid: 0 Local rank: 3 Node rank: 3
State: 0 App_context: 0 Slot list: NULL