Your patch looks fine to me, so I'll apply it. As for this second issue - good catch. Yes, if the binding directive was provided in the default MCA param file, then the proc would attempt to bind itself on startup. The best fix actually is to just tell them not to do so. We already have that mechanism for when we actually bind them - what we should instead be doing is using that flag to indicate that any specified binding has already been applied.

I will follow that path. Thanks for pointing it out!
Ralph

On Mar 3, 2014, at 12:55 AM, tmishima@jcity.maeda.co.jp wrote:



Hi Ralph, I misunderstood the point of the problem.

The problem is that BIND_TO_OBJ is re-tried and done in
orte_ess_base_proc_binding @ ess_base_fns.c, although you try to
BIND_TO_NONE in rmaps_rr_mapper.c when it's oversubscribed.
Furthermore, binding in orte_ess_base_proc_binding does not support
cpus_per_rank. So when BIND_TO_CORE is specified and it's
oversubscribed with pe=N, the final binding we get is broken.

If you really want to BIND TO NONE, you should delete binding part of
orte_ess_base_proc_binding. Or, if it's used for other purpose and
impossible to delete, it's better that you instead delete
"OPAL_SET_BINDING_ POLICY(OPAL_BIND_TO_NONE) in the rr_mappers and
just leave warning message.


Tetsuya

Hi Ralph, I have tested your fix - 30895. I'm afraid to say
I found a mistake.

You should include "SETTING BIND_TO_NONE" in the above if-clause
at the line 74, 256, 511, 656. Othrewise, just warning message
disappears but binding to core is still overwritten by binding
to none. Pleaes see attached patch.

(See attached file: patch_from_30895)

Tetsuya


Hi Ralph, I understood what you meant.

I often use float for our applicatoin.
float c = (float)(unsinged int a - unsinged int b) could
be very huge number, if a < b. So I always carefully cast to
int from unsigned int when I subtract them. I didn't know/mind
inc d = (unsinged int a - unsinged int b) has no problem.
I noticed it by your suggestion, thanks.

Therefore, I think my fix is not necesarry.

Tetsuya


Yes, indeed. In future, when we will have many many cores
in the machine, we will have to take care of overrun of
num_procs.

Tetsuya

Cool - easily modified. Thanks!

Of course, you understand (I'm sure) that the cast does nothing to
protect the code from blowing up if we overrun the var. In other
words,
if
the unsigned var has wrapped, then casting it to int
won't help - you'll still get a negative integer, and the code will
trash.


On Feb 28, 2014, at 3:43 PM, tmishima@jcity.maeda.co.jp wrote:



Hi Ralph, I'm a litte bit late to your release.

I found a minor mistake in byobj_span -integer casting problem.

--- rmaps_rr_mappers.30892.c    2014-03-01 08:31:50 +0900
+++ rmaps_rr_mappers.c  2014-03-01 08:33:22 +0900
@@ -689,7 +689,7 @@
   }

   /* compute how many objs need an extra proc */
-    if (0 > (nxtra_objs = app->num_procs - (navg * nobjs))) {
+    if (0 > (nxtra_objs = (int)app->num_procs - (navg *
(int)nobjs)))
{
       nxtra_objs = 0;
   }

Tetsuya

Please take a look at
https://svn.open-mpi.org/trac/ompi/ticket/4317


On Feb 27, 2014, at 8:13 PM, tmishima@jcity.maeda.co.jp wrote:



Hi Ralph, I can't operate our cluster for a few days, sorry.

But now, I'm narrowing down the cause by browsing the source
code.

My best guess is the line 529. The
opal_hwloc_base_get_obj_by_type
will
reset the object pointer to the first one when you move on to
the
next
node.

529                    if (NULL == (obj =
opal_hwloc_base_get_obj_by_type
(node->topology, target, cache_level, i,
OPAL_HWLOC_AVAILABLE)))
{
530                        ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
531                        return ORTE_ERR_NOT_FOUND;
532                    }

if node->slots=1, then nprocs is set as nprocs=1 in the second
pass:

495            nprocs = (node->slots - node->slots_inuse) /
orte_rmaps_base.cpus_per_rank;
496            if (nprocs < 1) {
497                if (second_pass) {
498                    /* already checked for oversubscription
permission,
so at least put
499                     * one proc on it
500                     */
501                    nprocs = 1;

Therefore, opal_hwloc_base_get_obj_by_type is called one by one
at
each
node, which means
the object we get is always first one.

It's not elegant but I guess you need dummy calls of
opal_hwloc_base_get_obj_by_type to
move the object pointer to the right place or modify
opal_hwloc_base_get_obj_by_type itself.

Tetsuya

I'm having trouble seeing why it is failing, so I added some
more
debug
output. Could you run the failure case again with -mca
rmaps_base_verbose
10?

Thanks
Ralph

On Feb 27, 2014, at 6:11 PM, tmishima@jcity.maeda.co.jp wrote:



Just checking the difference, not so significant meaning...

Anyway, I guess it's due to the behavior when slot counts is
missing
(regarded as slots=1) and it's oversubscribed
unintentionally.

I'm going out now, so I can't verify it quickly. If I provide
the
correct slot counts, it wll work, I guess. How do you think?

Tetsuya

"restore" in what sense?

On Feb 27, 2014, at 4:10 PM, tmishima@jcity.maeda.co.jp
wrote:



Hi Ralph, this is just for your information.

I tried to restore previous orte_rmaps_rr_byobj. Then I
gets
the
result
below with this command line:

mpirun -np 8 -host node05,node06 -report-bindings -map-by
socket:pe=2
-display-map  -bind-to core:overload-allowed
~/mis/openmpi/demos/myprog
Data for JOB [31184,1] offset 0

========================   JOB MAP
========================

Data for node: node05  Num slots: 1    Max slots: 0    Num
procs:
7
   Process OMPI jobid: [31184,1] App: 0 Process rank: 0
   Process OMPI jobid: [31184,1] App: 0 Process rank: 2
   Process OMPI jobid: [31184,1] App: 0 Process rank: 4
   Process OMPI jobid: [31184,1] App: 0 Process rank: 6
   Process OMPI jobid: [31184,1] App: 0 Process rank: 1
   Process OMPI jobid: [31184,1] App: 0 Process rank: 3
   Process OMPI jobid: [31184,1] App: 0 Process rank: 5

Data for node: node06  Num slots: 1    Max slots: 0    Num
procs:
1
   Process OMPI jobid: [31184,1] App: 0 Process rank: 7


=============================================================
[node06.cluster:18857] MCW rank 7 bound to socket 0[core 0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:21399] MCW rank 3 bound to socket 1[core 6
[hwt
0]],
socket
1[core 7[hwt 0]]: [./././.][././B/B]
[node05.cluster:21399] MCW rank 4 bound to socket 0[core 0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:21399] MCW rank 5 bound to socket 1[core 4
[hwt
0]],
socket
1[core 5[hwt 0]]: [./././.][B/B/./.]
[node05.cluster:21399] MCW rank 6 bound to socket 0[core 2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node05.cluster:21399] MCW rank 0 bound to socket 0[core 0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:21399] MCW rank 1 bound to socket 1[core 4
[hwt
0]],
socket
1[core 5[hwt 0]]: [./././.][B/B/./.]
[node05.cluster:21399] MCW rank 2 bound to socket 0[core 2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
....


Then I add "-hostfile pbs_hosts" and the result is:

[mishima@manage work]$cat pbs_hosts
node05 slots=8
node06 slots=8
[mishima@manage work]$ mpirun -np 8 -hostfile
~/work/pbs_hosts
-report-bindings -map-by socket:pe=2 -display-map
~/mis/openmpi/demos/myprog
Data for JOB [30254,1] offset 0

========================   JOB MAP
========================

Data for node: node05  Num slots: 8    Max slots: 0    Num
procs:
4
   Process OMPI jobid: [30254,1] App: 0 Process rank: 0
   Process OMPI jobid: [30254,1] App: 0 Process rank: 2
   Process OMPI jobid: [30254,1] App: 0 Process rank: 1
   Process OMPI jobid: [30254,1] App: 0 Process rank: 3

Data for node: node06  Num slots: 8    Max slots: 0    Num
procs:
4
   Process OMPI jobid: [30254,1] App: 0 Process rank: 4
   Process OMPI jobid: [30254,1] App: 0 Process rank: 6
   Process OMPI jobid: [30254,1] App: 0 Process rank: 5
   Process OMPI jobid: [30254,1] App: 0 Process rank: 7


=============================================================
[node05.cluster:21501] MCW rank 2 bound to socket 0[core 2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node05.cluster:21501] MCW rank 3 bound to socket 1[core 6
[hwt
0]],
socket
1[core 7[hwt 0]]: [./././.][././B/B]
[node05.cluster:21501] MCW rank 0 bound to socket 0[core 0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:21501] MCW rank 1 bound to socket 1[core 4
[hwt
0]],
socket
1[core 5[hwt 0]]: [./././.][B/B/./.]
[node06.cluster:18935] MCW rank 6 bound to socket 0[core 2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node06.cluster:18935] MCW rank 7 bound to socket 1[core 6
[hwt
0]],
socket
1[core 7[hwt 0]]: [./././.][././B/B]
[node06.cluster:18935] MCW rank 4 bound to socket 0[core 0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node06.cluster:18935] MCW rank 5 bound to socket 1[core 4
[hwt
0]],
socket
1[core 5[hwt 0]]: [./././.][B/B/./.]
....


I think previous version's behavior would be close to what
I
expect.

Tetusya

They have 4 cores/socket and 2 sockets, totally 4 X 2 = 8
cores,
each.

Here is the output of lstopo.

mishima@manage round_robin]$ rsh node05
Last login: Tue Feb 18 15:10:15 from manage
[mishima@node05 ~]$ lstopo
Machine (32GB)
NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (6144KB)
L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core
L#0
+
PU
L#0
(P#0)
L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core
L#1
+
PU
L#1
(P#1)
L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core
L#2
+
PU
L#2
(P#2)
L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core
L#3
+
PU
L#3
(P#3)
NUMANode L#1 (P#1 16GB) + Socket L#1 + L3 L#1 (6144KB)
L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core
L#4
+
PU
L#4
(P#4)
L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core
L#5
+
PU
L#5
(P#5)
L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core
L#6
+
PU
L#6
(P#6)
L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core
L#7
+
PU
L#7
(P#7)
....

I foucused on byobj_span and bynode. I didn't notice byobj
was
modified,
sorry.

Tetsuya

Hmmm..what does your node look like again (sockets and
cores)?

On Feb 27, 2014, at 3:19 PM, tmishima@jcity.maeda.co.jp
wrote:


Hi Ralph, I'm afraid to say your new "map-by obj" causes
another
problem.

I have overload message with this command line as shown
below:

mpirun -np 8 -host node05,node06 -report-bindings
-map-by
socket:pe=2
-display-map ~/mis/openmpi/d
emos/myprog









--------------------------------------------------------------------------
A request was made to bind to that would result in
binding
more
processes than cpus on a resource:

Bind to:         CORE
Node:            node05
#processes:  2
#cpus:          1

You can override this protection by adding the
"overload-allowed"
option to your binding directive.









--------------------------------------------------------------------------

Then, I add "-bind-to core:overload-allowed" to see what
happenes.

mpirun -np 8 -host node05,node06 -report-bindings
-map-by
socket:pe=2
-display-map -bind-to core:o
verload-allowed ~/mis/openmpi/demos/myprog
Data for JOB [14398,1] offset 0

========================   JOB MAP
========================

Data for node: node05  Num slots: 1    Max slots: 0
Num
procs:
4
  Process OMPI jobid: [14398,1] App: 0 Process rank: 0
  Process OMPI jobid: [14398,1] App: 0 Process rank: 1
  Process OMPI jobid: [14398,1] App: 0 Process rank: 2
  Process OMPI jobid: [14398,1] App: 0 Process rank: 3

Data for node: node06  Num slots: 1    Max slots: 0
Num
procs:
4
  Process OMPI jobid: [14398,1] App: 0 Process rank: 4
  Process OMPI jobid: [14398,1] App: 0 Process rank: 5
  Process OMPI jobid: [14398,1] App: 0 Process rank: 6
  Process OMPI jobid: [14398,1] App: 0 Process rank: 7


=============================================================
[node06.cluster:18443] MCW rank 6 bound to socket 0[core
0> [hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:20901] MCW rank 2 bound to socket 0[core
0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node06.cluster:18443] MCW rank 7 bound to socket 0[core
2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node05.cluster:20901] MCW rank 3 bound to socket 0[core
2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node06.cluster:18443] MCW rank 4 bound to socket 0[core
0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:20901] MCW rank 0 bound to socket 0[core
0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node06.cluster:18443] MCW rank 5 bound to socket 0[core
2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node05.cluster:20901] MCW rank 1 bound to socket 0[core
2
[hwt
0]],> > >>>>>>>> socket
0[core 3[hwt 0]]: [././B/B][./././.]
Hello world from process 4 of 8
Hello world from process 2 of 8
Hello world from process 6 of 8
Hello world from process 0 of 8
Hello world from process 5 of 8
Hello world from process 1 of 8
Hello world from process 7 of 8
Hello world from process 3 of 8

When I add "map-by obj:span", it works fine:

mpirun -np 8 -host node05,node06 -report-bindings
-map-by
socket:pe=2,span
-display-map  ~/mis/ope
nmpi/demos/myprog
Data for JOB [14703,1] offset 0

========================   JOB MAP
========================

Data for node: node05  Num slots: 1    Max slots: 0
Num
procs:
4
  Process OMPI jobid: [14703,1] App: 0 Process rank: 0
  Process OMPI jobid: [14703,1] App: 0 Process rank: 2
  Process OMPI jobid: [14703,1] App: 0 Process rank: 1
  Process OMPI jobid: [14703,1] App: 0 Process rank: 3
Data for node: node06  Num slots: 1    Max
slots: 0    Num
procs:
4
  Process OMPI jobid: [14703,1] App: 0 Process rank: 4
  Process OMPI jobid: [14703,1] App: 0 Process rank: 6
  Process OMPI jobid: [14703,1] App: 0 Process rank: 5
  Process OMPI jobid: [14703,1] App: 0 Process rank: 7


=============================================================
[node06.cluster:18491] MCW rank 6 bound to socket 0[core
2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node05.cluster:20949] MCW rank 2 bound to socket 0[core
2
[hwt
0]],
socket
0[core 3[hwt 0]]: [././B/B][./././.]
[node06.cluster:18491] MCW rank 7 bound to socket 1[core
6
[hwt
0]],
socket>>>>>>>>>> 1[core 7[hwt 0]]: [./././.][././B/B]
[node05.cluster:20949] MCW rank 3 bound to socket 1[core
6
[hwt
0]],
socket
1[core 7[hwt 0]]: [./././.][././B/B]
[node06.cluster:18491] MCW rank 4 bound to socket 0[core
0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node05.cluster:20949] MCW rank 0 bound to socket 0[core
0
[hwt
0]],
socket
0[core 1[hwt 0]]: [B/B/./.][./././.]
[node06.cluster:18491] MCW rank 5 bound to socket 1[core
4
[hwt
0]],
socket
1[core 5[hwt 0]]: [./././.][B/B/./.]
[node05.cluster:20949] MCW rank 1 bound to socket 1[core
4
[hwt
0]],
socket
1[core 5[hwt 0]]: [./././.][B/B/./.]
....

So, byobj_span would be okay. Of course, bynode and
byslot
should
be
okay.
Could you take a look at orte_rmaps_rr_byobj again?

Regards,
Tetsuya Mishima

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org>>>>>
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users
- patch_from_30895_______________________________________________
users mailing list
users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users