Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
From: Eloi Gaudry (eloi.gaudry_at_[hidden])
Date: 2012-04-06 06:17:58


> - Can you please post while it's running the relevant lines from:
> ps -e f --cols=500
> (f w/o -) from both machines.
> It's allocated between the nodes more like in a round-robin fashion.
> [eg: ] I'll try to do this tomorrow, as soon as some slots become free. Thanks for your feedback Reuti, I appreciate.

 
hi reuti, here is the information related to another run that is failing in the same way:

 
qstat -g t:

------------
---------------------------------------------------------------------------------
smp4.q_at_barney.fft              BIP   0/3/4          3.37     lx-amd64
  hc:mem_available=1.715G
  hc:proc_available=1
   1416 0.60500 semi_green jj           r     04/06/2012 11:57:34 SLAVE
                                                                  SLAVE
                                                                  SLAVE
---------------------------------------------------------------------------------
smp4.q_at_carl.fft                BIP   0/3/4          3.44     lx-amd64
  hc:mem_available=1.715G
  hc:proc_available=1
   1416 0.60500 semi_green jj           r     04/06/2012 11:57:34 SLAVE
                                                                  SLAVE
                                                                  SLAVE
---------------------------------------------------------------------------------
smp8.q_at_charlie.fft             BIP   0/6/8          3.46     lx-amd64
  hc:mem_available=4.018G
  hc:proc_available=2
   1416 0.60500 semi_green jj           r     04/06/2012 11:57:34 MASTER
                                                                  SLAVE
                                                                  SLAVE
                                                                  SLAVE
                                                                  SLAVE
                                                                  SLAVE
                                                                  SLAVE

 
barney: ps -e f --cols=500:

-----------------------------------

 2048 ?        Sl     3:33 /opt/sge/bin/lx-amd64/sge_execd
27502 ?        Sl     0:00  \_ sge_shepherd-1416 -bg
27503 ?        Ss     0:00      \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney
27510 ?        S      0:00          \_ bash -c  PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess e
nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
27511 ?        S      0:00              \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
27512 ?        Rl    12:54                  \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
27513 ?        Rl    12:54                  \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 
carl: ps -e f --cols=500:

-------------------------------

 1928 ?        Sl     3:10 /opt/sge/bin/lx-amd64/sge_execd
29022 ?        Sl     0:00  \_ sge_shepherd-1416 -bg
29023 ?        Ss     0:00      \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/carl/active_jobs/1416.1/1.carl
29030 ?        S      0:00          \_ bash -c  PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess e
nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
29031 ?        S      0:00              \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
29032 ?        Rl    13:49                  \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29033 ?        Rl    13:50                  \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29034 ?        Rl    13:49                  \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
29035 ?        Rl    13:49                  \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 
 
charlie: ps -e f --cols=500:

-----------------------------------

 1591 ?        Sl     3:13 /opt/sge/bin/lx-amd64/sge_execd
 8793 ?        S      0:00  \_ sge_shepherd-1416 -bg
 8795 ?        Ss     0:00      \_ -bash /opt/sge/default/spool/charlie/job_scripts/1416
 8800 ?        S      0:00          \_ /opt/openmpi-1.4.4/bin/orterun --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 --bynode -report-bindings -display-map -display-devel-map -display-allocation -display-devel-allocation -np 12 -x ACTRAN_LICENSE -x ACTRAN_PRODUCTLINE -x LD_LIBRARY_PATH -x PATH -x ACTRAN_DEBUG /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parall
 8801 ?        Sl     0:00              \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V barney.fft  PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose
 8802 ?        Sl     0:00              \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V carl.fft  PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
 8807 ?        Rl    14:23              \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 8808 ?        Rl    14:23              \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 8809 ?        Rl    14:23              \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
 8810 ?        Rl    14:23              \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat

 
oretrun information:

--------------------------

[charlie:08800] ras:gridengine: JOB_ID: 1416
[charlie:08800] ras:gridengine: PE_HOSTFILE: /opt/sge/default/spool/charlie/active_jobs/1416.1/pe_hostfile
[charlie:08800] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6
[charlie:08800] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=3
[charlie:08800] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3

======================   ALLOCATED NODES   ======================

 Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: [[57989,0],0] Daemon launched: True
  Num slots: 6  Slots in use: 0
  Num slots allocated: 6  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0
 Data for node: Name: barney.fft    Launch id: -1 Arch: 0 State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: Not defined Daemon launched: False
  Num slots: 3  Slots in use: 0
  Num slots allocated: 3  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0
 Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: Not defined Daemon launched: False
  Num slots: 3  Slots in use: 0
  Num slots allocated: 3  Max slots: 0
  Username on node: NULL
  Num procs: 0  Next node_rank: 0

=================================================================

 Map generated by mapping policy: 0200
  Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
  Num new daemons: 2  New daemon starting vpid 1
  Num nodes: 3

 Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: [[57989,0],0] Daemon launched: True
  Num slots: 6  Slots in use: 4
  Num slots allocated: 6  Max slots: 0
  Username on node: NULL
  Num procs: 4  Next node_rank: 4
  Data for proc: [[57989,1],0]
    Pid: 0  Local rank: 0 Node rank: 0
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],3]
    Pid: 0  Local rank: 1 Node rank: 1
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],6]
    Pid: 0  Local rank: 2 Node rank: 2
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],9]
    Pid: 0  Local rank: 3 Node rank: 3
    State: 0  App_context: 0  Slot list: NULL


 Data for node: Name: barney.fft    Launch id: -1 Arch: 0 State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: [[57989,0],1] Daemon launched: False
  Num slots: 3  Slots in use: 4
  Num slots allocated: 3  Max slots: 0
  Username on node: NULL
  Num procs: 4  Next node_rank: 4
  Data for proc: [[57989,1],1]
    Pid: 0  Local rank: 0 Node rank: 0
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],4]
    Pid: 0  Local rank: 1 Node rank: 1
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],7]
    Pid: 0  Local rank: 2 Node rank: 2
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],10]
    Pid: 0  Local rank: 3 Node rank: 3
    State: 0  App_context: 0  Slot list: NULL

 Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
  Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
  Daemon: [[57989,0],2] Daemon launched: False
  Num slots: 3  Slots in use: 4
  Num slots allocated: 3  Max slots: 0
  Username on node: NULL
  Num procs: 4  Next node_rank: 4
  Data for proc: [[57989,1],2]
    Pid: 0  Local rank: 0 Node rank: 0
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],5]
    Pid: 0  Local rank: 1 Node rank: 1
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],8]
    Pid: 0  Local rank: 2 Node rank: 2
    State: 0  App_context: 0  Slot list: NULL
  Data for proc: [[57989,1],11]
    Pid: 0  Local rank: 3 Node rank: 3
    State: 0  App_context: 0  Slot list: NULL