Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
From: Reuti (reuti_at_[hidden])
Date: 2012-04-09 19:05:57


Am 06.04.2012 um 12:17 schrieb Eloi Gaudry:

> > - Can you please post while it's running the relevant lines from:
> > ps -e f --cols=500
> > (f w/o -) from both machines.
> > It's allocated between the nodes more like in a round-robin fashion.
> > [eg: ] I'll try to do this tomorrow, as soon as some slots become free. Thanks for your feedback Reuti, I appreciate.
>
> hi reuti, here is the information related to another run that is failing in the same way:
>
> qstat -g t:
> ------------
> ---------------------------------------------------------------------------------
> smp4.q_at_barney.fft BIP 0/3/4 3.37 lx-amd64
> hc:mem_available=1.715G
> hc:proc_available=1
> 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE
> SLAVE
> SLAVE
> ---------------------------------------------------------------------------------
> smp4.q_at_carl.fft BIP 0/3/4 3.44 lx-amd64
> hc:mem_available=1.715G
> hc:proc_available=1
> 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE
> SLAVE
> SLAVE
> ---------------------------------------------------------------------------------
> smp8.q_at_charlie.fft BIP 0/6/8 3.46 lx-amd64
> hc:mem_available=4.018G
> hc:proc_available=2
> 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 MASTER
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE

Thx. This is the allocation which is also confirmed by the Open MPI output.

- The application was compiled with the same version of Open MPI?
- Does the application start something on its own besides the tasks granted by mpiexec/orterun?

You want 12 ranks in total, and to barney.fft and carl.fft there are also "-mca orte_ess_num_procs 3 " given in to the qrsh_starter. In total I count only 10 ranks in this example given - 4+4+2 - do you observe the same?

It looks like Open MPI is doing the right thing, but the applications decided to start in a different allocation.

Does the application use OpenMP in addition or other kinds of threads? The suffix "_mp" in the name "actranpy_mp" makes me suspicious about it.

-- Reuti

> barney: ps -e f --cols=500:
> -----------------------------------
> 2048 ? Sl 3:33 /opt/sge/bin/lx-amd64/sge_execd
> 27502 ? Sl 0:00 \_ sge_shepherd-1416 -bg
> 27503 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney
> 27510 ? S 0:00 \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e
> nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
> 27511 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
> 27512 ? Rl 12:54 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 27513 ? Rl 12:54 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>
> carl: ps -e f --cols=500:
> -------------------------------
> 1928 ? Sl 3:10 /opt/sge/bin/lx-amd64/sge_execd
> 29022 ? Sl 0:00 \_ sge_shepherd-1416 -bg
> 29023 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/carl/active_jobs/1416.1/1.carl
> 29030 ? S 0:00 \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e
> nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
> 29031 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
> 29032 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 29033 ? Rl 13:50 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 29034 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 29035 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>
>
> charlie: ps -e f --cols=500:
> -----------------------------------
> 1591 ? Sl 3:13 /opt/sge/bin/lx-amd64/sge_execd
> 8793 ? S 0:00 \_ sge_shepherd-1416 -bg
> 8795 ? Ss 0:00 \_ -bash /opt/sge/default/spool/charlie/job_scripts/1416
> 8800 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orterun --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 --bynode -report-bindings -display-map -display-devel-map -display-allocation -display-devel-allocation -np 12 -x ACTRAN_LICENSE -x ACTRAN_PRODUCTLINE -x LD_LIBRARY_PATH -x PATH -x ACTRAN_DEBUG /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parall
> 8801 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V barney.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose
> 8802 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V carl.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1
> 8807 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 8808 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 8809 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 8810 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>
> oretrun information:
> --------------------------
> [charlie:08800] ras:gridengine: JOB_ID: 1416
> [charlie:08800] ras:gridengine: PE_HOSTFILE: /opt/sge/default/spool/charlie/active_jobs/1416.1/pe_hostfile
> [charlie:08800] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6
> [charlie:08800] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=3
> [charlie:08800] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3
>
> ====================== ALLOCATED NODES ======================
>
> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: [[57989,0],0] Daemon launched: True
> Num slots: 6 Slots in use: 0
> Num slots allocated: 6 Max slots: 0
> Username on node: NULL
> Num procs: 0 Next node_rank: 0
> Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: Not defined Daemon launched: False
> Num slots: 3 Slots in use: 0
> Num slots allocated: 3 Max slots: 0
> Username on node: NULL
> Num procs: 0 Next node_rank: 0
> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: Not defined Daemon launched: False
> Num slots: 3 Slots in use: 0
> Num slots allocated: 3 Max slots: 0
> Username on node: NULL
> Num procs: 0 Next node_rank: 0
>
> =================================================================
>
> Map generated by mapping policy: 0200
> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
> Num new daemons: 2 New daemon starting vpid 1
> Num nodes: 3
>
> Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: [[57989,0],0] Daemon launched: True
> Num slots: 6 Slots in use: 4
> Num slots allocated: 6 Max slots: 0
> Username on node: NULL
> Num procs: 4 Next node_rank: 4
> Data for proc: [[57989,1],0]
> Pid: 0 Local rank: 0 Node rank: 0
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],3]
> Pid: 0 Local rank: 1 Node rank: 1
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],6]
> Pid: 0 Local rank: 2 Node rank: 2
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],9]
> Pid: 0 Local rank: 3 Node rank: 3
> State: 0 App_context: 0 Slot list: NULL
>
> Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: [[57989,0],1] Daemon launched: False
> Num slots: 3 Slots in use: 4
> Num slots allocated: 3 Max slots: 0
> Username on node: NULL
> Num procs: 4 Next node_rank: 4
> Data for proc: [[57989,1],1]
> Pid: 0 Local rank: 0 Node rank: 0
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],4]
> Pid: 0 Local rank: 1 Node rank: 1
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],7]
> Pid: 0 Local rank: 2 Node rank: 2
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],10]
> Pid: 0 Local rank: 3 Node rank: 3
> State: 0 App_context: 0 Slot list: NULL
>
> Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> Daemon: [[57989,0],2] Daemon launched: False
> Num slots: 3 Slots in use: 4
> Num slots allocated: 3 Max slots: 0
> Username on node: NULL
> Num procs: 4 Next node_rank: 4
> Data for proc: [[57989,1],2]
> Pid: 0 Local rank: 0 Node rank: 0
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],5]
> Pid: 0 Local rank: 1 Node rank: 1
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],8]
> Pid: 0 Local rank: 2 Node rank: 2
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[57989,1],11]
> Pid: 0 Local rank: 3 Node rank: 3
> State: 0 App_context: 0 Slot list: NULL
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users