Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: tmishima_at_[hidden]
Date: 2013-12-19 03:54:26


Hi Ralph, sorry for intersecting post.

Your advice about -hetero-nodes in other thread gives me a hint.

I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
you told me a month ago that my environment would need this option.

Removing this line from mca-params.conf, then it works.
In other word, you can replicate it by adding -hetero-nodes as
shown below.

qsub: job 8364.manage.cluster completed
[mishima_at_manage mpi]$ qsub -I -l nodes=2:ppn=8
qsub: waiting for job 8365.manage.cluster to start
qsub: job 8365.manage.cluster ready

[mishima_at_node11 ~]$ ompi_info --all | grep orte_hetero_nodes
                MCA orte: parameter "orte_hetero_nodes" (current value:
"false", data source: default, level: 9 dev/all,
 type: bool)
[mishima_at_node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
myprog
[node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
[node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
[mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
-hetero-nodes myprog
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node: node12
   #processes: 2
   #cpus: 1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

As far as I checked, data->num_bound seems to become bad in bind_downwards,
when I put "-hetero-nodes". I hope you can clear the problem.

Regards,
Tetsuya Mishima

> Yes, it's very strange. But I don't think there's any chance that
> I have < 8 actual cores on the node. I guess that you cat replicate
> it with SLURM, please try it again.
>
> I changed to use node10 and node11, then I got the warning against
> node11.
>
> Furthermore, just as an information for you, I tried to add
> "-bind-to core:overload-allowed", then it worked as shown below.
> But I think node11 is never overloaded because it has 8 cores.
>
> qsub: job 8342.manage.cluster completed
> [mishima_at_manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
> qsub: waiting for job 8343.manage.cluster to start
> qsub: job 8343.manage.cluster ready
>
> [mishima_at_node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima_at_node10 demos]$ cat $PBS_NODEFILE
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> [mishima_at_node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
>
--------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
> Bind to: CORE
> Node: node11
> #processes: 2
> #cpus: 1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
>
--------------------------------------------------------------------------
> [mishima_at_node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> -bind-to core:overload-allowed myprog
> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]],
socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]],
socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]],
socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> Hello world from process 1 of 4
> Hello world from process 0 of 4
> Hello world from process 3 of 4
> Hello world from process 2 of 4
>
> Regards,
> Tetsuya Mishima
>
>
> > Very strange - I can't seem to replicate it. Is there any chance that
you
> have < 8 actual cores on node12?
> >
> >
> > On Dec 18, 2013, at 4:53 PM, tmishima_at_[hidden] wrote:
> >
> > >
> > >
> > > Hi Ralph, sorry for confusing you.
> > >
> > > At that time, I cut and paste the part of "cat $PBS_NODEFILE".
> > > I guess I didn't paste the last line by my mistake.
> > >
> > > I retried the test and below one is exactly what I got when I did the
> test.
> > >
> > > [mishima_at_manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8
> > > qsub: waiting for job 8338.manage.cluster to start
> > > qsub: job 8338.manage.cluster ready
> > >
> > > [mishima_at_node11 ~]$ cat $PBS_NODEFILE
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node11
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > node12
> > > [mishima_at_node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
> > >
>
--------------------------------------------------------------------------
> > > A request was made to bind to that would result in binding more
> > > processes than cpus on a resource:
> > >
> > > Bind to: CORE
> > > Node: node12
> > > #processes: 2
> > > #cpus: 1
> > >
> > > You can override this protection by adding the "overload-allowed"
> > > option to your binding directive.
> > >
>
--------------------------------------------------------------------------
> > >
> > > Regards,
> > >
> > > Tetsuya Mishima
> > >
> > >> I removed the debug in #2 - thanks for reporting it
> > >>
> > >> For #1, it actually looks to me like this is correct. If you look at
> your
> > > allocation, there are only 7 slots being allocated on node12, yet you
> have
> > > asked for 8 cpus to be assigned (2 procs with 2
> > >> cpus/proc). So the warning is in fact correct
> > >>
> > >>
> > >> On Dec 18, 2013, at 4:04 PM, tmishima_at_[hidden] wrote:
> > >>
> > >>>
> > >>>
> > >>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So
I'd
> > > like
> > >>> to report
> > >>> 3 issues mainly regarding -cpus-per-proc.
> > >>>
> > >>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2
> > > sockets X
> > >>> 4 cores/socket),
> > >>> it starts to produce the error again as shown below. At least,
> > >>> openmpi-1.7.4a1r29646 did
> > >>> work well.
> > >>>
> > >>> [mishima_at_manage ~]$ qsub -I -l nodes=2:ppn=8
> > >>> qsub: waiting for job 8336.manage.cluster to start
> > >>> qsub: job 8336.manage.cluster ready
> > >>>
> > >>> [mishima_at_node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>> [mishima_at_node11 demos]$ cat $PBS_NODEFILE
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node11
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> node12
> > >>> [mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4
> -report-bindings
> > >>> myprog
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>> A request was made to bind to that would result in binding more
> > >>> processes than cpus on a resource:
> > >>>
> > >>> Bind to: CORE
> > >>> Node: node12
> > >>> #processes: 2
> > >>> #cpus: 1
> > >>>
> > >>> You can override this protection by adding the "overload-allowed"
> > >>> option to your binding directive.
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>
> > >>> Of course it works well using only one node.
> > >>>
> > >>> [mishima_at_node11 demos]$ mpirun -np 2 -cpus-per-proc 4
> -report-bindings
> > >>> myprog
> > >>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > > socket
> > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > >>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> > > socket
> > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > >>> Hello world from process 1 of 2
> > >>> Hello world from process 0 of 2
> > >>>
> > >>>
> > >>> 2) Adding "-bind-to numa", it works but the message "bind:upward
> target
> > >>> NUMANode type NUMANode" appears.
> > >>> As far as I remember, I didn't see such a kind of message before.
> > >>>
> > >>> mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4
-report-bindings
> > >>> -bind-to numa myprog
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
type
> > >>> NUMANode
> > >>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > > socket
> > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > >>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> > > socket
> > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > >>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> > > socket
> > >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> > >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> > >>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> > > socket
> > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> > >>> Hello world from process 1 of 4
> > >>> Hello world from process 0 of 4
> > >>> Hello world from process 3 of 4
> > >>> Hello world from process 2 of 4
> > >>>
> > >>>
> > >>> 3) I use PGI compiler. It can not accept compiler switch
> > >>> "-Wno-variadic-macros", which is
> > >>> included in configure script.
> > >>>
> > >>> btl_usnic_CFLAGS="-Wno-variadic-macros"
> > >>>
> > >>> I removed this switch, then I could continue to build 1.7.4rc1.
> > >>>
> > >>> Regards,
> > >>> Tetsuya Mishima
> > >>>
> > >>>
> > >>>> Hmmm...okay, I understand the scenario. Must be something in the
> algo
> > >>> when it only has one node, so it shouldn't be too hard to track
down.
> > >>>>
> > >>>> I'm off on travel for a few days, but will return to this when I
get
> > >>> back.
> > >>>>
> > >>>> Sorry for delay - will try to look at this while I'm gone, but
can't
> > >>> promise anything :-(
> > >>>>
> > >>>>
> > >>>> On Dec 10, 2013, at 6:58 PM, tmishima_at_[hidden] wrote:
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> Hi Ralph, sorry for confusing.
> > >>>>>
> > >>>>> We usually logon to "manage", which is our control node.
> > >>>>> From manage, we submit job or enter a remote node such as
> > >>>>> node03 by torque interactive mode(qsub -I).
> > >>>>>
> > >>>>> At that time, instead of torque, I just did rsh to node03 from
> manage
> > >>>>> and ran myprog on the node. I hope you could understand what I
did.
> > >>>>>
> > >>>>> Now, I retried with "-host node03", which still causes the
problem:
> > >>>>> (I comfirmed local run on manage caused the same problem too)
> > >>>>>
> > >>>>> [mishima_at_manage ~]$ rsh node03
> > >>>>> Last login: Wed Dec 11 11:38:57 from manage
> > >>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>> [mishima_at_node03 demos]$
> > >>>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03
-report-bindings
> > >>>>> -cpus-per-proc 4 -map-by socket myprog
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>> A request was made to bind to that would result in binding more
> > >>>>> processes than cpus on a resource:
> > >>>>>
> > >>>>> Bind to: CORE
> > >>>>> Node: node03
> > >>>>> #processes: 2
> > >>>>> #cpus: 1
> > >>>>>
> > >>>>> You can override this protection by adding the "overload-allowed"
> > >>>>> option to your binding directive.
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>
> > >>>>> It' strange, but I have to report that "-map-by socket:span"
worked
> > >>> well.
> > >>>>>
> > >>>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03
-report-bindings
> > >>>>> -cpus-per-proc 4 -map-by socket:span myprog
> > >>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt
0]],
> > >>> socket
> > >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt
> 0]],
> > >>> socket
> > >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>> socket 1[core 15[hwt 0]]:
> > >>>>>
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt
> 0]],
> > >>> socket
> > >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>> socket 2[core 19[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt
> 0]],
> > >>> socket
> > >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>> socket 2[core 23[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt
> 0]],
> > >>> socket
> > >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>> socket 3[core 27[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt
> 0]],
> > >>> socket
> > >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>> socket 3[core 31[hwt 0]]:
> > >>>>>
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> > >>> socket
> > >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>> cket 0[core 3[hwt 0]]:
> > >>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt
0]],
> > >>> socket
> > >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>> cket 0[core 7[hwt 0]]:
> > >>>>>
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>> Hello world from process 2 of 8
> > >>>>> Hello world from process 6 of 8
> > >>>>> Hello world from process 3 of 8
> > >>>>> Hello world from process 7 of 8
> > >>>>> Hello world from process 1 of 8
> > >>>>> Hello world from process 5 of 8
> > >>>>> Hello world from process 0 of 8
> > >>>>> Hello world from process 4 of 8
> > >>>>>
> > >>>>> Regards,
> > >>>>> Tetsuya Mishima
> > >>>>>
> > >>>>>
> > >>>>>> On Dec 10, 2013, at 6:05 PM, tmishima_at_[hidden] wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Hi Ralph,
> > >>>>>>>
> > >>>>>>> I tried again with -cpus-per-proc 2 as shown below.
> > >>>>>>> Here, I found that "-map-by socket:span" worked well.
> > >>>>>>>
> > >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 2
> > >>>>>>> -map-by socket:span myprog
> > >>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> > >>>>>>> /./././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][B/B/./././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][././B/B/./././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][./././././././.][B/B/./././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /./././.][./././././././.][././B/B/./././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> Hello world from process 1 of 8
> > >>>>>>> Hello world from process 0 of 8
> > >>>>>>> Hello world from process 4 of 8
> > >>>>>>> Hello world from process 2 of 8
> > >>>>>>> Hello world from process 7 of 8
> > >>>>>>> Hello world from process 6 of 8
> > >>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from
process 3 of 8
> > >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 2
> > >>>>>>> -map-by socket myprog
> > >>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> > >>>>>>> /./././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /B/B/./.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././.
> > >>>>>>> /././B/B][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> > >>>>>>> /././.][./././././././.][./././././././.]
> > >>>>>>> Hello world from process 5 of 8
> > >>>>>>> Hello world from process 1 of 8
> > >>>>>>> Hello world from process 6 of 8
> > >>>>>>> Hello world from process 4 of 8
> > >>>>>>> Hello world from process 2 of 8
> > >>>>>>> Hello world from process 0 of 8
> > >>>>>>> Hello world from process 7 of 8
> > >>>>>>> Hello world from process 3 of 8
> > >>>>>>>
> > >>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
> > >>>>>>> In this case, I guess "-map-by socket:span" and "-map-by
socket"
> > > has
> > >>>>> same
> > >>>>>>> meaning.
> > >>>>>>> Therefore, there's no problem about that. Sorry for distubing.
> > >>>>>>
> > >>>>>> No problem - glad you could clear that up :-)
> > >>>>>>
> > >>>>>>>
> > >>>>>>> By the way, through this test, I found another problem.
> > >>>>>>> Without torque manager and just using rsh, it causes the same
> error
> > >>>>> like
> > >>>>>>> below:
> > >>>>>>>
> > >>>>>>> [mishima_at_manage openmpi-1.7]$ rsh node03
> > >>>>>>> Last login: Wed Dec 11 09:42:02 from manage
> > >>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 4
> > >>>>>>> -map-by socket myprog
> > >>>>>>
> > >>>>>> I don't understand the difference here - you are simply starting
> it
> > >>> from>>>>> a different node? It looks like everything is expected to
> run local
> > > to
> > >>>>> mpirun, yes? So there is no rsh actually involved here.
> > >>>>>> Are you still running in an allocation?
> > >>>>>>
> > >>>>>> If you run this with "-host node03" on the cmd line, do you see
> the
> > >>> same
> > >>>>> problem?
> > >>>>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>> A request was made to bind to that would result in binding more
> > >>>>>>> processes than cpus on a resource:
> > >>>>>>>
> > >>>>>>> Bind to: CORE
> > >>>>>>> Node: node03
> > >>>>>>> #processes: 2
> > >>>>>>> #cpus: 1
> > >>>>>>>
> > >>>>>>> You can override this protection by adding the
"overload-allowed"
> > >>>>>>> option to your binding directive.
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>> [mishima_at_node03 demos]$
> > >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > > -cpus-per-proc
> > >>> 4
> > >>>>>>> myprog
> > >>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>> socket 3[core 27[hwt 0]]:>>>>>
> > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt
> > > 0]],
> > >>>>> socket
> > >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>
> > > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>
> > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt
> 0]],
> > >>>>> socket
> > >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>
> > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>> Hello world from process 4 of 8
> > >>>>>>> Hello world from process 2 of 8
> > >>>>>>> Hello world from process 6 of 8
> > >>>>>>> Hello world from process 5 of 8
> > >>>>>>> Hello world from process 3 of 8
> > >>>>>>> Hello world from process 7 of 8
> > >>>>>>> Hello world from process 0 of 8
> > >>>>>>> Hello world from process 1 of 8
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Tetsuya Mishima
> > >>>>>>>
> > >>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but
> let
> > >>> me
> > >>>>>>> poke around a bit and see what might be happening.
> > >>>>>>>>
> > >>>>>>>> On Dec 10, 2013, at 4:47 PM, tmishima_at_[hidden] wrote:
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Hi Ralph,
> > >>>>>>>>>
> > >>>>>>>>> Thanks. I didn't know the meaning of "socket:span".
> > >>>>>>>>>
> > >>>>>>>>> But it still causes the problem, which seems socket:span
> doesn't
> > >>>>> work.
> > >>>>>>>>>
> > >>>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=node03:ppn=32
> > >>>>>>>>> qsub: waiting for job 8265.manage.cluster to start
> > >>>>>>>>> qsub: job 8265.manage.cluster ready
> > >>>>>>>>>
> > >>>>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > >>> -cpus-per-proc
> > >>>>> 4
> > >>>>>>>>> -map-by socket:span myprog
> > >>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8
[hwt
> > > 0]],
> > >>>>>>> socket
> > >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>>>> socket 3[core 27[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28
[hwt
> > >>> 0]],
> > >>>>>>> socket
> > >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>>>
> > >>>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0
[hwt
> > > 0]],
> > >>>>>>> socket
> > >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>>>
> > >>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4
[hwt
> > > 0]],
> > >>>>>>> socket
> > >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>>>
> > >>>
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>> Hello world from process 0 of 8
> > >>>>>>>>> Hello world from process 3 of 8
> > >>>>>>>>> Hello world from process 1 of 8
> > >>>>>>>>> Hello world from process 4 of 8
> > >>>>>>>>> Hello world from process 6 of 8
> > >>>>>>>>> Hello world from process 5 of 8
> > >>>>>>>>> Hello world from process 2 of 8
> > >>>>>>>>> Hello world from process 7 of 8
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>
> > >>>>>>>>>> No, that is actually correct. We map a socket until full,
then
> > >>> move
> > >>>>> to
> > >>>>>>>>> the next. What you want is --map-by socket:span
> > >>>>>>>>>>
> > >>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmishima_at_[hidden]
wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Ralph,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I had a time to try your patch yesterday using
> > >>>>> openmpi-1.7.4a1r29646.
> > >>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by
> > >>> socket" itself
> > >>>>>>>>> didn't
> > >>>>>>>>>>> work
> > >>>>>>>>>>> well as shown bellow:
> > >>>>>>>>>>>
> > >>>>>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=1:ppn=32
> > >>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start
> > >>>>>>>>>>> qsub: job 8260.manage.cluster ready
> > >>>>>>>>>>>
> > >>>>>>>>>>> [mishima_at_node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > >>>>>>>>>>> [mishima_at_node04 demos]$ mpirun -np 8 -report-bindings
> > >>>>> -cpus-per-proc
> > >>>>>>> 4
> > >>>>>>>>>>> -map-by socket myprog
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8
> [hwt
> > >>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>>>>>> socket 3[core 27[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28
> [hwt
> > >>>>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0
> [hwt
> > >>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4
> [hwt
> > >>> 0]],
> > >>>>>>>>> socket
> > >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>>>>>
> > >>>>>
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>> Hello world from process 2 of 8
> > >>>>>>>>>>> Hello world from process 1 of 8
> > >>>>>>>>>>> Hello world from process 3 of 8
> > >>>>>>>>>>> Hello world from process 0 of 8
> > >>>>>>>>>>> Hello world from process 6 of 8
> > >>>>>>>>>>> Hello world from process 5 of 8
> > >>>>>>>>>>> Hello world from process 4 of 8
> > >>>>>>>>>>> Hello world from process 7 of 8
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think this should be like this:
> > >>>>>>>>>>>
> > >>>>>>>>>>> rank 00
> > >>>>>>>>>>>
> > >>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>> rank 01
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>>>> rank 02
> > >>>>>>>>>>>
> > >>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of
RM)
> > > and
> > >>>>>>> have
> > >>>>>>>>>>> scheduled it for 1.7.4.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>> Ralph
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmishima_at_[hidden]
> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Ralph,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thank you very much for your quick response.>
>>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm afraid to say that I found one more issuse...
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It's not so serious. Please check it when you have a lot
of
> > >>> time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under
> Torque
> > >>>>>>>>> manager.
> > >>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the
> same
> > >>>>>>>>>>>>> behaviour under Slurm manager.

> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite
well.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [mishima_at_manage testbed2]$ qsub -I -l nodes=1:ppn=32
> > >>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
> > >>>>>>>>>>>>> qsub: job 8116.manage.cluster ready
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [mishima_at_node03 ~]$ cd ~/Ducom/testbed2
> > >>>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
> > >>>>>>>>> -cpus-per-proc
> > >>>>>>>>>>> 4
> > >>>>>>>>>>>>> -map-by socket mPre
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>>>>>>>> A request was made to bind to that would result in
binding
> > > more
> > >>>>>>>>>>>>> processes than cpus on a resource:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Bind to: CORE
> > >>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2
> > >>>>>>>>>>>>> #cpus: 1
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> You can override this protection by adding the
> > >>> "overload-allowed"
> > >>>>>>>>>>>>> option to your binding directive.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
--------------------------------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
> > >>>>>>>>> -cpus-per-proc
> > >>>>>>>>>>> 4
> > >>>>>>>>>>>>> mPre
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core
8
> > > [hwt
> > >>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core
12
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > >>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core
16
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > >>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core
20
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > >>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core
24
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > >>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core
28
> > > [hwt
> > >>>>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > >>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > >
>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>>

> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0
> > > [hwt
> > >>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > >>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core
4
> > > [hwt
> > >>>>> 0]],
> > >>>>>>>>>>> socket
> > >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > >>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
> > >>>>>>>>>>>>>
> > >>>>>>>
> > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> Regards,
> > >>>>>>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain
> > > <rhc_at_[hidden]>
> > >>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when
I
> > > had
> > >>>>>>>>> time :-)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I'll update tomorrow.
> > >>>>>>>>>>>>>> Ralph
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
> > >>>>>>> <tmishima_at_[hidden]>wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi Ralph,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in
> > >>> oob_tcp.c
> > >>>>>>> of
> > >>>>>>>>>>>>>> openmpi-1.7.4a1r29646".
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I found the cause.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine
> can
> > >>> not.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Your host file:
> > >>>>>>>>>>>>>> cat hosts
> > >>>>>>>>>>>>>> bend001 slots=12
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> My host file:
> > >>>>>>>>>>>>>> cat hosts
> > >>>>>>>>>>>>>> node08
> > >>>>>>>>>>>>>> node08
> > >>>>>>>>>>>>>> ...(total 8 lines)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line
of
> > > my
> > >>>>>>>>> hostfile
> > >>>>>>>>>>>>>> just before launching mpirun. Then it worked.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> My host file(modified):
> > >>>>>>>>>>>>>> cat hosts
> > >>>>>>>>>>>>>> node08 slots=1
> > >>>>>>>>>>>>>> node08 slots=1
> > >>>>>>>>>>>>>> ...(total 8 lines)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference
> > >>> between
> > >>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
> > >>>>> 1.7.4a1r29646.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> $ diff
> > >>>>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>
> > >
hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> > >>>>>>>>>>>>>> 394,401c394,399
> > >>>>>>>>>>>>>> < if (got_count) {
> > >>>>>>>>>>>>>> < node->slots_given = true;
> > >>>>>>>>>>>>>> < } else if (got_max) {
> > >>>>>>>>>>>>>> < node->slots = node->slots_max;
> > >>>>>>>>>>>>>> < node->slots_given = true;
> > >>>>>>>>>>>>>> < } else {
> > >>>>>>>>>>>>>> < /* should be set by obj_new, but just to be
> clear
> > > */
> > >>>>>>>>>>>>>> < node->slots_given = false;
> > >>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>>> if (!got_count) {
> > >>>>>>>>>>>>>>> if (got_max) {
> > >>>>>>>>>>>>>>> node->slots = node->slots_max;
> > >>>>>>>>>>>>>>> } else {
> > >>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>> ....
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative
> > > trial.
> > >>>>>>>>>>>>>> Then, it worked.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
> > >>>>>>>>>>>>>> ...
> > >>>>>>>>>>>>>> 394 if (got_count) {
> > >>>>>>>>>>>>>> 395 node->slots_given = true;
> > >>>>>>>>>>>>>> 396 } else if (got_max) {
> > >>>>>>>>>>>>>> 397 node->slots = node->slots_max;
> > >>>>>>>>>>>>>> 398 node->slots_given = true;
> > >>>>>>>>>>>>>> 399 } else {
> > >>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be
> > > clear
> > >>>>> */
> > >>>>>>>>>>>>>> 401 node->slots_given
> > > = false;
> > >>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */
> > >>>>>>>>>>>>>> 403 }
> > >>>>>>>>>>>>>> ...
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Please fix the problem properly, because it's just based
> on
> > > my
> > >>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile
> > > where
> > >>>>>>> slots
> > >>>>>>>>>>>>>> information is not given.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>> Tetsuya Mishima
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>>>> users_at_[hidden]
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >
>
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

>
> > >
> > >>>
> > >>>>>
> > >>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>>>>
> > >>>>>>>
> > > users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>>> users_at_[hidden]
> > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>> users_at_[hidden]
> > >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>>
> > >>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>> users mailing list
> > >>>>>>>>>>> users_at_[hidden]
> > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>>
> > >>>>>>>>>> _______________________________________________
> > >>>>>>>>>> users mailing list
> > >>>>>>>>>> users_at_[hidden]
> > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>>
> > >>>>>>>>> _______________________________________________
> > >>>>>>>>> users mailing list
> > >>>>>>>>> users_at_[hidden]
> > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> users mailing list
> > >>>>>>>> users_at_[hidden]
> > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> users mailing list
> > >>>>>>> users_at_[hidden]
> > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> users mailing list
> > >>>>>> users_at_[hidden]
> > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> users_at_[hidden]
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>
> > >>>> _______________________________________________
> > >>>> users mailing list
> > >>>> users_at_[hidden]
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> users_at_[hidden]
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> users_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users