Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange behaviour of SGE+OpenMPI
From: PN (poknam_at_[hidden])
Date: 2009-04-01 12:31:15


Thanks. I've tried your suggestion.

$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
/opt/openmpi-gcc/bin/mpirun -mca ras_gridengine_verbose 100 -v -np $NSLOTS
--host node0001,node0002 hostname

It allocated 2 nodes to run, however all the processes are spawned in
node0001.

$ qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
all.q_at_[hidden] BIPC 0/4/4 4.79 lx24-amd64
     45 0.55500 HPL_8cpu_G admin r 04/02/2009 00:26:49 4
---------------------------------------------------------------------------------
all.q_at_[hidden] BIPC 0/4/4 0.00 lx24-amd64
     45 0.55500 HPL_8cpu_G admin r 04/02/2009 00:26:49 4

$ cat HPL_8cpu_GB.o45
[node0001:03194] ras:gridengine: JOB_ID: 45
[node0001:03194] ras:gridengine: node0001.v5cluster.com: PE_HOSTFILE shows
slots=4
[node0001:03194] ras:gridengine: node0002.v5cluster.com: PE_HOSTFILE shows
slots=4
node0001
node0001
node0001
node0001
node0001
node0001
node0001
node0001

$ qconf -sq all.q
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:01:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list blcr
pe_list make mpi-rr mpi-fu orte
rerun FALSE
slots 4,[node0001=4],[node0002=4]
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY

$ qconf -se node0001
hostname node0001.v5cluster.com
load_scaling NONE
complex_values slots=4
load_values arch=lx24-amd64,num_proc=4,mem_total=3949.597656M, \
                      swap_total=0.000000M,virtual_total=3949.597656M, \
                      load_avg=2.800000,load_short=0.220000, \
                      load_medium=2.800000,load_long=2.320000, \
                      mem_free=3818.746094M,swap_free=0.000000M, \
                      virtual_free=3818.746094M,mem_used=130.851562M, \
                      swap_used=0.000000M,virtual_used=130.851562M, \
                      cpu=0.000000,np_load_avg=0.700000, \
                      np_load_short=0.055000,np_load_medium=0.700000, \
                      np_load_long=0.580000
processors 4
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE

$ qconf -se node0002
hostname node0002.v5cluster.com
load_scaling NONE
complex_values slots=4
load_values arch=lx24-amd64,num_proc=4,mem_total=3949.597656M, \
                      swap_total=0.000000M,virtual_total=3949.597656M, \
                      load_avg=0.000000,load_short=0.000000, \
                      load_medium=0.000000,load_long=0.000000, \
                      mem_free=3843.074219M,swap_free=0.000000M, \
                      virtual_free=3843.074219M,mem_used=106.523438M, \
                      swap_used=0.000000M,virtual_used=106.523438M, \
                      cpu=0.000000,np_load_avg=0.000000, \
                      np_load_short=0.000000,np_load_medium=0.000000, \
                      np_load_long=0.000000
processors 4
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE

2009/4/1 Rolf Vandevaart <Rolf.Vandevaart_at_[hidden]>

> It turns out that the use of --host and --hostfile act as a filter of which
> nodes to run on when you are running under SGE. So, listing them several
> times does not affect where the processes land. However, this still does
> not explain why you are seeing what you are seeing. One thing you can try
> is to add this to the mpirun command.
>
> -mca ras_gridengine_verbose 100
>
> This will provide some additional information as to what Open MPI is seeing
> as nodes and slots from SGE. (Is there any chance that node0002 actually
> has 8 slots?)
>
> I just retried on my cluster of 2 CPU sparc solaris nodes. When I run with
> np=2, the two MPI processes will all land on a single node, because that
> node has two slots. When I go up to np=4, then they move on to the other
> node. The --host acts as a filter to where they should run.
>
> In terms of the using "IB bonding", I do not know what that means exactly.
> Open MPI does stripe over multiple IB interfaces, so I think the answer is
> yes.
>
> Rolf
>
> PS: Here is what my np=4 job script looked like. (I just changed np=2 for
> the other run)
>
> burl-ct-280r-0 148 =>more run.sh
> #! /bin/bash
> #$ -S /bin/bash
> #$ -V
> #$ -cwd
> #$ -N Job1
> #$ -pe orte 200
> #$ -j y
> #$ -l h_rt=00:20:00 # Run time (hh:mm:ss) - 10 min
>
> echo $NSLOTS
> /opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -v -np 4
> -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcp hostname
>
> Here is the output (somewhat truncated)
> burl-ct-280r-0 150 =>more Job1.o199
> 200
> [burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
> [burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE:
> /ws/ompi-tools/orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/pe_hostfile
> [..snip..]
> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILE shows
> slots=2
> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILE shows
> slots=2
> [..snip..]
> burl-ct-280r-1
> burl-ct-280r-1
> burl-ct-280r-0
> burl-ct-280r-0
> burl-ct-280r-0 151 =>
>
>
>
> On 03/31/09 22:39, PN wrote:
>
>> Dear Rolf,
>>
>> Thanks for your reply.
>> I've created another PE and changed the submission script, explicitly
>> specify the hostname with "--host".
>> However the result is the same.
>>
>> # qconf -sp orte
>> pe_name orte
>> slots 8
>> user_lists NONE
>> xuser_lists NONE
>> start_proc_args /bin/true
>> stop_proc_args /bin/true
>> allocation_rule $fill_up
>> control_slaves TRUE
>> job_is_first_task FALSE
>> urgency_slots min
>> accounting_summary TRUE
>>
>> $ cat hpl-8cpu-test.sge
>> #!/bin/bash
>> #
>> #$ -N HPL_8cpu_GB
>> #$ -pe orte 8
>> #$ -cwd
>> #$ -j y
>> #$ -S /bin/bash
>> #$ -V
>> #
>> cd /home/admin/hpl-2.0
>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --host
>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>> ./bin/goto-openmpi-gcc/xhpl
>>
>>
>> # pdsh -a ps ax --width=200|grep hpl
>> node0002: 18901 ? S 0:00 /opt/openmpi-gcc/bin/mpirun -v -np 8
>> --host
>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>> ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18902 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18903 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18904 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18905 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18906 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18907 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18908 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18909 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>
>> Any hint to debug this situation?
>>
>> Also, if I have 2 IB ports in each node, which IB bonding was done, will
>> Open MPI automatically benefit from the double bandwidth?
>>
>> Thanks a lot.
>>
>> Best Regards,
>> PN
>>
>> 2009/4/1 Rolf Vandevaart <Rolf.Vandevaart_at_[hidden] <mailto:
>> Rolf.Vandevaart_at_[hidden]>>
>>
>>
>> On 03/31/09 11:43, PN wrote:
>>
>> Dear all,
>>
>> I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
>> I have 2 compute nodes for testing, each node has a single quad
>> core CPU.
>>
>> Here is my submission script and PE config:
>> $ cat hpl-8cpu.sge
>> #!/bin/bash
>> #
>> #$ -N HPL_8cpu_IB
>> #$ -pe mpi-fu 8
>> #$ -cwd
>> #$ -j y
>> #$ -S /bin/bash
>> #$ -V
>> #
>> cd /home/admin/hpl-2.0
>> # For IB
>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
>> $TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl
>>
>> I've tested the mpirun command can be run correctly in command
>> line.
>>
>> $ qconf -sp mpi-fu
>> pe_name mpi-fu
>> slots 8
>> user_lists NONE
>> xuser_lists NONE
>> start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
>> stop_proc_args /opt/sge/mpi/stopmpi.sh
>> allocation_rule $fill_up
>> control_slaves TRUE
>> job_is_first_task FALSE
>> urgency_slots min
>> accounting_summary TRUE
>>
>>
>> I've checked the $TMPDIR/machines after submit, it was correct.
>> node0002
>> node0002
>> node0002
>> node0002
>> node0001
>> node0001
>> node0001
>> node0001
>>
>> However, I found that if I explicitly specify the "-machinefile
>> $TMPDIR/machines", all 8 mpi processes were spawned within a
>> single node, i.e. node0002.
>>
>> However, if I omit "-machinefile $TMPDIR/machines" in the line
>> mpirun, i.e.
>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS
>> ./bin/goto-openmpi-gcc/xhpl
>>
>> The mpi processes can start correctly, 4 processes in node0001
>> and 4 processes in node0002.
>>
>> Is this normal behaviour of Open MPI?
>>
>>
>> I just tried it both ways and I got the same result both times. The
>> processes are split between the nodes. Perhaps to be extra sure,
>> you can just run hostname? And for what it is worth, as you have
>> seen, you do not need to specify a machines file. Open MPI will use
>> the ones that were allocated by SGE. You can also change your
>> parallel queue to not run any scripts. Like this:
>>
>> start_proc_args /bin/true
>> stop_proc_args /bin/true
>>
>>
>>
>> Also, I wondered if I have IB interface, for example, the
>> hostname of IB become node0001-clust and node0002-clust, will
>> Open MPI automatically use the IB interface?
>>
>> Yes, it should use the IB interface.
>>
>>
>> How about if I have 2 IB ports in each node, which IB bonding
>> was done, will Open MPI automatically benefit from the double
>> bandwidth?
>>
>> Thanks a lot.
>>
>> Best Regards,
>> PN
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> =========================
>> rolf.vandevaart_at_[hidden] <mailto:rolf.vandevaart_at_[hidden]>
>> 781-442-3043
>> =========================
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
>
> =========================
> rolf.vandevaart_at_[hidden]
> 781-442-3043
> =========================
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>