FWIW: I verified that this works fine under a slurm allocation of 2 nodes, each with 12 slots. I filled the node without getting an "oversbuscribed" error message

[rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 --report-bindings -hostfile hosts hostname
[bend001:24318] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../..][../../../../../..]
[bend001:24318] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../../BB/BB][BB/BB/../../../..]
[bend001:24318] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../..][../../BB/BB/BB/BB]
bend001
bend001
bend001

where

[rhc@bend001 svn-trunk]$ cat hosts
bend001 slots=12

The only way I get the "out of resources" error is if I ask for more processes than I have slots - i.e., I give it the hosts file as shown, but ask for 13 or more processes.


BTW: note one important issue with cpus-per-proc, as shown above. Because I specified 4 cpus/proc, and my sockets each have 6 cpus, one of my procs wound up being split across the two sockets (2 cores on each). That's about the worst situation you can have.

So a word of caution: it is up to the user to ensure that the mapping is "good". We just do what you asked us to do.


On Nov 13, 2013, at 8:30 PM, Ralph Castain <rhc@open-mpi.org> wrote:

Guess I don't see why modifying the allocation is required - we have mapping options that should support such things. If you specify the total number of procs you want, and cpus-per-proc=4, it should do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified np=16). So I guess I don't understand the issue.

Regardless, if NPROCS=8 (and you verified that by printing it out, not just assuming wc -l got that value), then it shouldn't think it is oversubscribed. I'll take a look under a slurm allocation as that is all I can access.


On Nov 13, 2013, at 7:23 PM, tmishima@jcity.maeda.co.jp wrote:



Our cluster consists of three types of nodes. They have 8, 32
and 64 slots respectively. Since the performance of each core is
almost same, mixed use of these nodes is possible.

Furthremore, in this case, for hybrid application with openmpi+openmp,
the modification of hostfile is necesarry as follows:

#PBS -l nodes=1:ppn=32+4:ppn=8
export OMP_NUM_THREADS=4
modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS
Myprog

That's why I want to do that.

Of course I know, If I quit mixed use, -npernode is better for this
purpose.

(The script I showed you first is just a simplified one to clarify the
problem.)

tmishima


Why do it the hard way? I'll look at the FAQ because that definitely
isn't a recommended thing to do - better to use -host to specify the
subset, or just specify the desired mapping using all the
various mappers we provide.

On Nov 13, 2013, at 6:39 PM, tmishima@jcity.maeda.co.jp wrote:



Sorry for cross-post.

Nodefile is very simple which consists of 8 lines:

node08
node08
node08
node08
node08
node08
node08
node08

Therefore, NPROCS=8

My aim is to modify the allocation as you pointed out. According to
Openmpi
FAQ,
proper subset of the hosts allocated to the Torque / PBS Pro job should
be
allowed.

tmishima

Please - can you answer my question on script2? What is the value of
NPROCS?

Why would you want to do it this way? Are you planning to modify the
allocation?? That generally is a bad idea as it can confuse the system


On Nov 13, 2013, at 5:55 PM, tmishima@jcity.maeda.co.jp wrote:



Since what I really want is to run script2 correctly, please let us
concentrate script2.

I'm not an expert of the inside of openmpi. What I can do is just
obsabation
from the outside. I doubt these lines are strange, especially the
last
one.

[node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
[node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
[node08.cluster:26952] [[56581,0],0] Filtering thru apps
[node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
[node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
inuse
0

These lines come from this part of orte_rmaps_base_get_target_nodes
in rmaps_base_support_fns.c:

    } else if (node->slots <= node->slots_inuse &&
               (ORTE_MAPPING_NO_OVERSUBSCRIBE &
ORTE_GET_MAPPING_DIRECTIVE(policy))) {
        /* remove the node as fully used */
        OPAL_OUTPUT_VERBOSE((5,
orte_rmaps_base_framework.framework_output,
                             "%s Removing node %s slots %d inuse
%d",
                             ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
                             node->name, node->slots, node->
slots_inuse));
        opal_list_remove_item(allocated_nodes, item);
        OBJ_RELEASE(item);  /* "un-retain" it */

I wonder why node->slots and node->slots_inuse is 0, which I can read
from the above line "Removing node node08 slots 0 inuse 0".

Or I'm not sure but
"else if (node->slots <= node->slots_inuse &&" should be
"else if (node->slots < node->slots_inuse &&" ?

tmishima

On Nov 13, 2013, at 4:43 PM, tmishima@jcity.maeda.co.jp wrote:



Yes, the node08 has 8 slots but the process I run is also 8.

#PBS -l nodes=node08:ppn=8

Therefore, I think it should allow this allocation. Is that right?

Correct


My question is why scritp1 works and script2 does not. They are
almost same.

#PBS -l nodes=node08:ppn=8
export OMP_NUM_THREADS=1
cd $PBS_O_WORKDIR
cp $PBS_NODEFILE pbs_hosts
NPROCS=`wc -l < pbs_hosts`

#SCRITP1
mpirun -report-bindings -bind-to core Myprog

#SCRIPT2
mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
-bind-to
core

This version is not only reading the PBS allocation, but also
invoking
the hostfile filter on top of it. Different code path. I'll take a
look
-
it should still match up assuming NPROCS=8. Any
possibility that it is a different number? I don't recall, but isn't
there some extra lines in the nodefile - e.g., comments?


Myprog

tmishima

I guess here's my confusion. If you are using only one node, and
that
node has 8 allocated slots, then we will not allow you to run more
than
8
processes on that node unless you specifically provide
the --oversubscribe flag. This is because you are operating in a
managed
environment (in this case, under Torque), and so we treat the
allocation as
"mandatory" by default.

I suspect that is the issue here, in which case the system is
behaving
as
it should.

Is the above accurate?


On Nov 13, 2013, at 4:11 PM, Ralph Castain <rhc@open-mpi.org>
wrote:

It has nothing to do with LAMA as you aren't using that mapper.

How many nodes are in this allocation?

On Nov 13, 2013, at 4:06 PM, tmishima@jcity.maeda.co.jp wrote:



Hi Ralph, this is an additional information.

Here is the main part of output by adding "-mca
rmaps_base_verbose
50".

[node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
[node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating
map
[node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP
in
allocation
[node08.cluster:26952] mca:rmaps: mapping job [56581,1]
[node08.cluster:26952] mca:rmaps: creating new map for job
[56581,1]
[node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
ppr
mapper
[node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
[56581,1]
[node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
seq
mapper
[node08.cluster:26952] mca:rmaps:resilient: cannot perform
initial
map
of
job [56581,1] - no fault groups
[node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
using
mindist
mapper
[node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
[node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
list
[node08.cluster:26952] [[56581,0],0] Filtering thru apps
[node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
[node08.cluster:26952] [[56581,0],0] Removing node node08 slots
0
inuse 0

From this result, I guess it's related to oversubscribe.
So I added "-oversubscribe" and rerun, then it worked well as
show
below:

[node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
list
[node08.cluster:27019] [[56774,0],0] Filtering thru apps
[node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
[node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
[node08.cluster:27019]     node: node08 daemon: 0
[node08.cluster:27019] [[56774,0],0] Starting bookmark at node
node08
[node08.cluster:27019] [[56774,0],0] Starting at node node08
[node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
[56774,1]
slots 1 num_procs 8
[node08.cluster:27019] mca:rmaps:rr:slot working node node08
[node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
skipping
[node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
oversubscribed -
performing second pass
[node08.cluster:27019] mca:rmaps:rr:slot working node node08
[node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to
node
node08
[node08.cluster:27019] mca:rmaps:base: computing vpids by slot
for
job
[56774,1]
[node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node
node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node
node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node
node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node
node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node
node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node
node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node
node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node
node08

I think something is wrong with treatment of oversubscription,
which
might
be
related to "#3893: LAMA mapper has problems"

tmishima

Hmmm...looks like we aren't getting your allocation. Can you
rerun
and
add -mca ras_base_verbose 50?

On Nov 12, 2013, at 11:30 PM, tmishima@jcity.maeda.co.jp wrote:



Hi Ralph,

Here is the output of "-mca plm_base_verbose 5".

[node08.cluster:23573] mca:base:select:(  plm) Querying
component
[rsh]
[node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
agent /usr/bin/rsh path NULL
[node08.cluster:23573] mca:base:select:(  plm) Query of
component
[rsh]
set
priority to 10
[node08.cluster:23573] mca:base:select:(  plm) Querying
component
[slurm]
[node08.cluster:23573] mca:base:select:(  plm) Skipping
component
[slurm].
Query failed to return a module
[node08.cluster:23573] mca:base:select:(  plm) Querying
component
[tm]
[node08.cluster:23573] mca:base:select:(  plm) Query of
component
[tm]
set
priority to 75
[node08.cluster:23573] mca:base:select:(  plm) Selected
component
[tm]
[node08.cluster:23573] plm:base:set_hnp_name: initial bias
23573
nodename
hash 85176670
[node08.cluster:23573] plm:base:set_hnp_name: final jobfam
59480
[node08.cluster:23573] [[59480,0],0] plm:base:receive start
comm
[node08.cluster:23573] [[59480,0],0] plm:base:setup_job
[node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
[node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
creating
map
[node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
HNP
in
allocation





--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.





--------------------------------------------------------------------------

Here, openmpi's configuration is as follows:

./configure \
--prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
--with-tm \
--with-verbs \
--disable-ipv6 \
--disable-vt \
--enable-debug \
CC=pgcc CFLAGS="-tp k8-64e" \
CXX=pgCC CXXFLAGS="-tp k8-64e" \
F77=pgfortran FFLAGS="-tp k8-64e" \
FC=pgfortran FCFLAGS="-tp k8-64e"

Hi Ralph,

Okey, I can help you. Please give me some time to report the
output.

Tetsuya Mishima

I can try, but I have no way of testing Torque any more - so
all
I
can
do
is a code review. If you can build --enable-debug and add
-mca
plm_base_verbose 5 to your cmd line, I'd appreciate seeing
the
output.


On Nov 12, 2013, at 9:58 PM, tmishima@jcity.maeda.co.jp
wrote:



Hi Ralph,

Thank you for your quick response.

I'd like to report one more regressive issue about Torque
support
of
openmpi-1.7.4a1r29646, which might be related to "#3893:
LAMA
mapper
has problems" I reported a few days ago.

The script below does not work with openmpi-1.7.4a1r29646,
although it worked with openmpi-1.7.3 as I told you before.

#!/bin/sh
#PBS -l nodes=node08:ppn=8
export OMP_NUM_THREADS=1
cd $PBS_O_WORKDIR
cp $PBS_NODEFILE pbs_hosts
NPROCS=`wc -l < pbs_hosts`
mpirun -machinefile pbs_hosts -np ${NPROCS}
-report-bindings
-bind-to
core
Myprog

If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it
works
fine.
Since this happens without lama request, I guess it's not
the
problem
in lama itself. Anyway, please look into this issue as
well.

Regards,
Tetsuya Mishima

Done - thanks!

On Nov 12, 2013, at 7:35 PM, tmishima@jcity.maeda.co.jp
wrote:



Dear openmpi developers,

I got a segmentation fault in traial use of
openmpi-1.7.4a1r29646
built
by
PGI13.10 as shown below:

[mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4
-cpus-per-proc
2
-report-bindings mPre
[manage.cluster:23082] MCW rank 2 bound to socket 0[core
4
[hwt
0]],
socket
0[core 5[hwt 0]]: [././././B/B][./././././.]
[manage.cluster:23082] MCW rank 3 bound to socket 1[core
6
[hwt
0]],
socket
1[core 7[hwt 0]]: [./././././.][B/B/./././.]
[manage.cluster:23082] MCW rank 0 bound to socket 0[core
0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./././.][./././././.]
[manage.cluster:23082] MCW rank 1 bound to socket 0[core
2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B/./.][./././././.]
[manage:23082] *** Process received signal ***
[manage:23082] Signal: Segmentation fault (11)
[manage:23082] Signal code: Address not mapped (1)
[manage:23082] Failing at address: 0x34
[manage:23082] *** End of error message ***
Segmentation fault (core dumped)

[mishima@manage testbed-openmpi-1.7.3]$ gdb mpirun
core.23082
GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
Copyright (C) 2009 Free Software Foundation, Inc.
...
Core was generated by `mpirun -np 4 -cpus-per-proc 2
-report-bindings
mPre'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b5f861c9c4f in recv_connect>>>
(mod=0x5f861ca20b00007f,
sd=32767,
hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
(gdb) where
#0  0x00002b5f861c9c4f in recv_connect
(mod=0x5f861ca20b00007f,
sd=32767,
hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
#1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
flags=32767,
cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
#2  0x00002b5f848eb06a in
event_process_active_single_queue
(base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
at ./event.c:1366
#3  0x00002b5f848eb270 in event_process_active
(base=0x5f848eb84900007f)
at ./event.c:1435
#4  0x00002b5f848eb849 in
opal_libevent2021_event_base_loop
(base=0x4077a000007f, flags=32767) at ./event.c:1645
#5  0x00000000004077a0 in orterun (argc=7,
argv=0x7fff25bbd4a8)
at ./orterun.c:1030
#6  0x00000000004067fb in main (argc=7,
argv=0x7fff25bbd4a8)
at ./main.c:13
(gdb) quit


The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
unnecessary,
which
causes the segfault.

624      /* lookup the corresponding process
*/>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
origin);
626      if (NULL == peer) {
627          ui64 = (uint64_t*)(&peer->name);
628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
orte_oob_base_framework.framework_output,
629                              "%s
mca_oob_tcp_recv_connect:
connection from new peer",
630                              ORTE_NAME_PRINT
(ORTE_PROC_MY_NAME));
631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
632          peer->mod = mod;
633          peer->name = hdr->origin;
634          peer->state = MCA_OOB_TCP_ACCEPTING;
635          ui64 = (uint64_t*)(&peer->name);
636          if (OPAL_SUCCESS !=
opal_hash_table_set_value_uint64
(&mod->
peers, (*ui64), peer)) {
637              OBJ_RELEASE(peer);
638              return;
639          }


Please fix this mistake in the next release.

Regards,
Tetsuya Mishima

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list>> users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users