> - Can you please post while it's running the relevant lines from:
> ps -e f --cols=500
> (f w/o -) from both machines.
> It's allocated between the nodes more like in a round-robin fashion.
> [eg: ] I'll try to do this tomorrow, as soon as some slots become free. Thanks for your feedback Reuti, I appreciate.
Â
hi reuti, here is the information related to another run that is failing in the same way:
Â
qstat -g t:
------------
---------------------------------------------------------------------------------
smp4.q_at_barney.fft             BIP  0/3/4         3.37    lx-amd64
 hc:mem_available=1.715G
 hc:proc_available=1
  1416 0.60500 semi_green jj          r    04/06/2012 11:57:34 SLAVE
                                                                 SLAVE
                                                                 SLAVE
---------------------------------------------------------------------------------
smp4.q_at_carl.fft               BIP  0/3/4         3.44    lx-amd64
 hc:mem_available=1.715G
 hc:proc_available=1
  1416 0.60500 semi_green jj          r    04/06/2012 11:57:34 SLAVE
                                                                 SLAVE
                                                                 SLAVE
---------------------------------------------------------------------------------
smp8.q_at_charlie.fft            BIP  0/6/8         3.46    lx-amd64
 hc:mem_available=4.018G
 hc:proc_available=2
  1416 0.60500 semi_green jj          r    04/06/2012 11:57:34 MASTER
                                                                 SLAVE
                                                                 SLAVE
                                                                 SLAVE
                                                                 SLAVE
                                                                 SLAVE
                                                                 SLAVE
Â
barney: ps -e f --cols=500:
-----------------------------------
 2048 ?       Sl    3:33 /opt/sge/bin/lx-amd64/sge_execd
27502 ?       Sl    0:00 \_ sge_shepherd-1416 -bg
27503 ?       Ss    0:00     \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney
27510 ?       S     0:00         \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e
nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
27511 ?       S     0:00             \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
27512 ?       Rl   12:54                 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
27513 ?       Rl   12:54                 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
Â
carl: ps -e f --cols=500:
-------------------------------
 1928 ?       Sl    3:10 /opt/sge/bin/lx-amd64/sge_execd
29022 ?       Sl    0:00 \_ sge_shepherd-1416 -bg
29023 ?       Ss    0:00     \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/carl/active_jobs/1416.1/1.carl
29030 ?       S     0:00         \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e
nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
29031 ?       S     0:00             \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
29032 ?       Rl   13:49                 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29033 ?       Rl   13:50                 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29034 ?       Rl   13:49                 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29035 ?       Rl   13:49                 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
Â
Â
charlie: ps -e f --cols=500:
-----------------------------------
 1591 ?       Sl    3:13 /opt/sge/bin/lx-amd64/sge_execd
 8793 ?       S     0:00 \_ sge_shepherd-1416 -bg
 8795 ?       Ss    0:00     \_ -bash /opt/sge/default/spool/charlie/job_scripts/1416
 8800 ?       S     0:00         \_ /opt/openmpi-1.4.4/bin/orterun --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 --bynode -report-bindings -display-map -display-devel-map -display-allocation -display-devel-allocation -np 12 -x ACTRAN_LICENSE -x ACTRAN_PRODUCTLINE -x LD_LIBRARY_PATH -x PATH -x ACTRAN_DEBUG /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parall
 8801 ?       Sl    0:00             \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V barney.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose
 8802 ?       Sl    0:00             \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V carl.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
 8807 ?       Rl   14:23             \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 8808 ?       Rl   14:23             \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 8809 ?       Rl   14:23             \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 8810 ?       Rl   14:23             \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
Â
oretrun information:
--------------------------
[charlie:08800] ras:gridengine: JOB_ID: 1416
[charlie:08800] ras:gridengine: PE_HOSTFILE: /opt/sge/default/spool/charlie/active_jobs/1416.1/pe_hostfile
[charlie:08800] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6
[charlie:08800] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=3
[charlie:08800] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3
======================Â Â ALLOCATED NODESÂ Â ======================
 Data for node: Name: charlie  Launch id: -1 Arch: ffc91200 State: 2
 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
 Daemon: [[57989,0],0] Daemon launched: True
 Num slots: 6 Slots in use: 0
 Num slots allocated: 6 Max slots: 0
 Username on node: NULL
 Num procs: 0 Next node_rank: 0
 Data for node: Name: barney.fft   Launch id: -1 Arch: 0 State: 2
 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
 Daemon: Not defined Daemon launched: False
 Num slots: 3 Slots in use: 0
 Num slots allocated: 3 Max slots: 0
 Username on node: NULL
 Num procs: 0 Next node_rank: 0
 Data for node: Name: carl.fft   Launch id: -1 Arch: 0 State: 2
 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
 Daemon: Not defined Daemon launched: False
 Num slots: 3 Slots in use: 0
 Num slots allocated: 3 Max slots: 0
 Username on node: NULL
 Num procs: 0 Next node_rank: 0
=================================================================
 Map generated by mapping policy: 0200
 Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
 Num new daemons: 2 New daemon starting vpid 1
 Num nodes: 3
 Data for node: Name: charlie  Launch id: -1 Arch: ffc91200 State: 2
 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
 Daemon: [[57989,0],0] Daemon launched: True
 Num slots: 6 Slots in use: 4
 Num slots allocated: 6 Max slots: 0
 Username on node: NULL
 Num procs: 4 Next node_rank: 4
 Data for proc: [[57989,1],0]
   Pid: 0 Local rank: 0 Node rank: 0
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],3]
   Pid: 0 Local rank: 1 Node rank: 1
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],6]
   Pid: 0 Local rank: 2 Node rank: 2
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],9]
   Pid: 0 Local rank: 3 Node rank: 3
   State: 0 App_context: 0 Slot list: NULL
 Data for node: Name: barney.fft   Launch id: -1 Arch: 0 State: 2
 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
 Daemon: [[57989,0],1] Daemon launched: False
 Num slots: 3 Slots in use: 4
 Num slots allocated: 3 Max slots: 0
 Username on node: NULL
 Num procs: 4 Next node_rank: 4
 Data for proc: [[57989,1],1]
   Pid: 0 Local rank: 0 Node rank: 0
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],4]
   Pid: 0 Local rank: 1 Node rank: 1
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],7]
   Pid: 0 Local rank: 2 Node rank: 2
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],10]
   Pid: 0 Local rank: 3 Node rank: 3
   State: 0 App_context: 0 Slot list: NULL
 Data for node: Name: carl.fft   Launch id: -1 Arch: 0 State: 2
 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
 Daemon: [[57989,0],2] Daemon launched: False
 Num slots: 3 Slots in use: 4
 Num slots allocated: 3 Max slots: 0
 Username on node: NULL
 Num procs: 4 Next node_rank: 4
 Data for proc: [[57989,1],2]
   Pid: 0 Local rank: 0 Node rank: 0
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],5]
   Pid: 0 Local rank: 1 Node rank: 1
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],8]
   Pid: 0 Local rank: 2 Node rank: 2
   State: 0 App_context: 0 Slot list: NULL
 Data for proc: [[57989,1],11]
   Pid: 0 Local rank: 3 Node rank: 3
   State: 0 App_context: 0 Slot list: NULL
Â
Â
|