Thanks. I've tried your suggestion.
$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
/opt/openmpi-gcc/bin/mpirun -mca ras_gridengine_verbose 100 -v -np $NSLOTS --host node0001,node0002 hostname
It allocated 2 nodes to run, however all the processes are spawned in node0001.
$ qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@node0001.v5cluster.com BIPC 0/4/4 4.79 lx24-amd64
45 0.55500 HPL_8cpu_G admin r 04/02/2009 00:26:49 4
---------------------------------------------------------------------------------
all.q@node0002.v5cluster.com BIPC 0/4/4 0.00 lx24-amd64
45 0.55500 HPL_8cpu_G admin r 04/02/2009 00:26:49 4
$ cat HPL_8cpu_GB.o45
[node0001:03194] ras:gridengine: JOB_ID: 45
[node0001:03194] ras:gridengine: node0001.v5cluster.com: PE_HOSTFILE shows slots=4
[node0001:03194] ras:gridengine: node0002.v5cluster.com: PE_HOSTFILE shows slots=4
node0001
node0001
node0001
node0001
node0001
node0001
node0001
node0001
$ qconf -sq all.q
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:01:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list blcr
pe_list make mpi-rr mpi-fu orte
rerun FALSE
slots 4,[node0001=4],[node0002=4]
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
$ qconf -se node0001
hostname node0001.v5cluster.com
load_scaling NONE
complex_values slots=4
load_values arch=lx24-amd64,num_proc=4,mem_total=3949.597656M, \
swap_total=0.000000M,virtual_total=3949.597656M, \
load_avg=2.800000,load_short=0.220000, \
load_medium=2.800000,load_long=2.320000, \
mem_free=3818.746094M,swap_free=0.000000M, \
virtual_free=3818.746094M,mem_used=130.851562M, \
swap_used=0.000000M,virtual_used=130.851562M, \
cpu=0.000000,np_load_avg=0.700000, \
np_load_short=0.055000,np_load_medium=0.700000, \
np_load_long=0.580000
processors 4
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
$ qconf -se node0002
hostname node0002.v5cluster.com
load_scaling NONE
complex_values slots=4
load_values arch=lx24-amd64,num_proc=4,mem_total=3949.597656M, \
swap_total=0.000000M,virtual_total=3949.597656M, \
load_avg=0.000000,load_short=0.000000, \
load_medium=0.000000,load_long=0.000000, \
mem_free=3843.074219M,swap_free=0.000000M, \
virtual_free=3843.074219M,mem_used=106.523438M, \
swap_used=0.000000M,virtual_used=106.523438M, \
cpu=0.000000,np_load_avg=0.000000, \
np_load_short=0.000000,np_load_medium=0.000000, \
np_load_long=0.000000
processors 4
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
It turns out that the use of --host and --hostfile act as a filter of which nodes to run on when you are running under SGE. So, listing them several times does not affect where the processes land. However, this still does not explain why you are seeing what you are seeing. One thing you can try is to add this to the mpirun command.
-mca ras_gridengine_verbose 100
This will provide some additional information as to what Open MPI is seeing as nodes and slots from SGE. (Is there any chance that node0002 actually has 8 slots?)
I just retried on my cluster of 2 CPU sparc solaris nodes. When I run with np=2, the two MPI processes will all land on a single node, because that node has two slots. When I go up to np=4, then they move on to the other node. The --host acts as a filter to where they should run.
In terms of the using "IB bonding", I do not know what that means exactly. Open MPI does stripe over multiple IB interfaces, so I think the answer is yes.
Rolf
PS: Here is what my np=4 job script looked like. (I just changed np=2 for the other run)
burl-ct-280r-0 148 =>more run.sh
#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 200
#$ -j y
#$ -l h_rt=00:20:00 # Run time (hh:mm:ss) - 10 min
echo $NSLOTS
/opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -v -np 4 -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcp hostname
Here is the output (somewhat truncated)
burl-ct-280r-0 150 =>more Job1.o199
200
[burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
[burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE: /ws/ompi-tools/orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/pe_hostfile
[..snip..]
[burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILE shows slots=2
[burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILE shows slots=2
[..snip..]
burl-ct-280r-1
burl-ct-280r-1
burl-ct-280r-0
burl-ct-280r-0
burl-ct-280r-0 151 =>
On 03/31/09 22:39, PN wrote:
2009/4/1 Rolf Vandevaart <Rolf.Vandevaart@sun.com <mailto:Rolf.Vandevaart@sun.com>>Dear Rolf,
Thanks for your reply.
I've created another PE and changed the submission script, explicitly specify the hostname with "--host".
However the result is the same.
# qconf -sp orte
pe_name orte
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
cd /home/admin/hpl-2.0
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --host node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002 ./bin/goto-openmpi-gcc/xhpl
# pdsh -a ps ax --width=200|grep hpl
node0002: 18901 ? S 0:00 /opt/openmpi-gcc/bin/mpirun -v -np 8 --host node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002 ./bin/goto-openmpi-gcc/xhpl
node0002: 18902 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18903 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18904 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18905 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18906 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18907 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18908 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18909 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
Any hint to debug this situation?
Also, if I have 2 IB ports in each node, which IB bonding was done, will Open MPI automatically benefit from the double bandwidth?
Thanks a lot.
Best Regards,
PN
users@open-mpi.org <mailto:users@open-mpi.org> rolf.vandevaart@sun.com <mailto:rolf.vandevaart@sun.com>
On 03/31/09 11:43, PN wrote:
Dear all,
I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
I have 2 compute nodes for testing, each node has a single quad
core CPU.
Here is my submission script and PE config:
$ cat hpl-8cpu.sge
#!/bin/bash
#
#$ -N HPL_8cpu_IB
#$ -pe mpi-fu 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
cd /home/admin/hpl-2.0
# For IB
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
$TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl
I've tested the mpirun command can be run correctly in command line.
$ qconf -sp mpi-fu
pe_name mpi-fu
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/sge/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
I've checked the $TMPDIR/machines after submit, it was correct.
node0002
node0002
node0002
node0002
node0001
node0001
node0001
node0001
However, I found that if I explicitly specify the "-machinefile
$TMPDIR/machines", all 8 mpi processes were spawned within a
single node, i.e. node0002.
However, if I omit "-machinefile $TMPDIR/machines" in the line
mpirun, i.e.
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS
./bin/goto-openmpi-gcc/xhpl
The mpi processes can start correctly, 4 processes in node0001
and 4 processes in node0002.
Is this normal behaviour of Open MPI?
I just tried it both ways and I got the same result both times. The
processes are split between the nodes. Perhaps to be extra sure,
you can just run hostname? And for what it is worth, as you have
seen, you do not need to specify a machines file. Open MPI will use
the ones that were allocated by SGE. You can also change your
parallel queue to not run any scripts. Like this:
start_proc_args /bin/true
stop_proc_args /bin/true
Also, I wondered if I have IB interface, for example, the
hostname of IB become node0001-clust and node0002-clust, will
Open MPI automatically use the IB interface?
Yes, it should use the IB interface.
How about if I have 2 IB ports in each node, which IB bonding
was done, will Open MPI automatically benefit from the double
bandwidth?
Thanks a lot.
Best Regards,
PN
------------------------------------------------------------------------
_______________________________________________
users mailing listusers@open-mpi.org <mailto:users@open-mpi.org>
781-442-3043
=========================
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
------------------------------------------------------------------------
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
=========================
rolf.vandevaart@sun.com
781-442-3043
=========================
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users