Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-12-19 19:44:54


Actually, it looks like it would happen with hetero-nodes set - only required that at least two nodes have the same architecture. So you might want to give the trunk a shot as it may well now be fixed.

On Dec 19, 2013, at 8:35 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> Hmmm...not having any luck tracking this down yet. If anything, based on what I saw in the code, I would have expected it to fail when hetero-nodes was false, not the other way around.
>
> I'll keep poking around - just wanted to provide an update.
>
> On Dec 19, 2013, at 12:54 AM, tmishima_at_[hidden] wrote:
>
>>
>>
>> Hi Ralph, sorry for intersecting post.
>>
>> Your advice about -hetero-nodes in other thread gives me a hint.
>>
>> I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
>> you told me a month ago that my environment would need this option.
>>
>> Removing this line from mca-params.conf, then it works.
>> In other word, you can replicate it by adding -hetero-nodes as
>> shown below.
>>
>> qsub: job 8364.manage.cluster completed
>> [mishima_at_manage mpi]$ qsub -I -l nodes=2:ppn=8
>> qsub: waiting for job 8365.manage.cluster to start
>> qsub: job 8365.manage.cluster ready
>>
>> [mishima_at_node11 ~]$ ompi_info --all | grep orte_hetero_nodes
>> MCA orte: parameter "orte_hetero_nodes" (current value:
>> "false", data source: default, level: 9 dev/all,
>> type: bool)
>> [mishima_at_node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>> [mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> myprog
>> [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>> [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>> Hello world from process 0 of 4
>> Hello world from process 1 of 4
>> Hello world from process 2 of 4
>> Hello world from process 3 of 4
>> [mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> -hetero-nodes myprog
>> --------------------------------------------------------------------------
>> A request was made to bind to that would result in binding more
>> processes than cpus on a resource:
>>
>> Bind to: CORE
>> Node: node12
>> #processes: 2
>> #cpus: 1
>>
>> You can override this protection by adding the "overload-allowed"
>> option to your binding directive.
>> --------------------------------------------------------------------------
>>
>>
>> As far as I checked, data->num_bound seems to become bad in bind_downwards,
>> when I put "-hetero-nodes". I hope you can clear the problem.
>>
>> Regards,
>> Tetsuya Mishima
>>
>>
>>> Yes, it's very strange. But I don't think there's any chance that
>>> I have < 8 actual cores on the node. I guess that you cat replicate
>>> it with SLURM, please try it again.
>>>
>>> I changed to use node10 and node11, then I got the warning against
>>> node11.
>>>
>>> Furthermore, just as an information for you, I tried to add
>>> "-bind-to core:overload-allowed", then it worked as shown below.
>>> But I think node11 is never overloaded because it has 8 cores.
>>>
>>> qsub: job 8342.manage.cluster completed
>>> [mishima_at_manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
>>> qsub: waiting for job 8343.manage.cluster to start
>>> qsub: job 8343.manage.cluster ready
>>>
>>> [mishima_at_node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>> [mishima_at_node10 demos]$ cat $PBS_NODEFILE
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> [mishima_at_node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>>> myprog
>>>
>> --------------------------------------------------------------------------
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>>
>>> Bind to: CORE
>>> Node: node11
>>> #processes: 2
>>> #cpus: 1
>>>
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>>
>> --------------------------------------------------------------------------
>>> [mishima_at_node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>>> -bind-to core:overload-allowed myprog
>>> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]],
>> socket
>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]],
>> socket
>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]],
>> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>> Hello world from process 1 of 4
>>> Hello world from process 0 of 4
>>> Hello world from process 3 of 4
>>> Hello world from process 2 of 4
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>>
>>>> Very strange - I can't seem to replicate it. Is there any chance that
>> you
>>> have < 8 actual cores on node12?
>>>>
>>>>
>>>> On Dec 18, 2013, at 4:53 PM, tmishima_at_[hidden] wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi Ralph, sorry for confusing you.
>>>>>
>>>>> At that time, I cut and paste the part of "cat $PBS_NODEFILE".
>>>>> I guess I didn't paste the last line by my mistake.
>>>>>
>>>>> I retried the test and below one is exactly what I got when I did the
>>> test.
>>>>>
>>>>> [mishima_at_manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8
>>>>> qsub: waiting for job 8338.manage.cluster to start
>>>>> qsub: job 8338.manage.cluster ready
>>>>>
>>>>> [mishima_at_node11 ~]$ cat $PBS_NODEFILE
>>>>> node11
>>>>> node11
>>>>> node11
>>>>> node11
>>>>> node11
>>>>> node11
>>>>> node11
>>>>> node11
>>>>> node12
>>>>> node12
>>>>> node12
>>>>> node12
>>>>> node12
>>>>> node12
>>>>> node12
>>>>> node12
>>>>> [mishima_at_node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>>> myprog
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>> A request was made to bind to that would result in binding more
>>>>> processes than cpus on a resource:
>>>>>
>>>>> Bind to: CORE
>>>>> Node: node12
>>>>> #processes: 2
>>>>> #cpus: 1
>>>>>
>>>>> You can override this protection by adding the "overload-allowed"
>>>>> option to your binding directive.
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tetsuya Mishima
>>>>>
>>>>>> I removed the debug in #2 - thanks for reporting it
>>>>>>
>>>>>> For #1, it actually looks to me like this is correct. If you look at
>>> your
>>>>> allocation, there are only 7 slots being allocated on node12, yet you
>>> have
>>>>> asked for 8 cpus to be assigned (2 procs with 2
>>>>>> cpus/proc). So the warning is in fact correct
>>>>>>
>>>>>>
>>>>>> On Dec 18, 2013, at 4:04 PM, tmishima_at_[hidden] wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So
>> I'd
>>>>> like
>>>>>>> to report
>>>>>>> 3 issues mainly regarding -cpus-per-proc.
>>>>>>>
>>>>>>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2
>>>>> sockets X
>>>>>>> 4 cores/socket),
>>>>>>> it starts to produce the error again as shown below. At least,
>>>>>>> openmpi-1.7.4a1r29646 did
>>>>>>> work well.
>>>>>>>
>>>>>>> [mishima_at_manage ~]$ qsub -I -l nodes=2:ppn=8
>>>>>>> qsub: waiting for job 8336.manage.cluster to start
>>>>>>> qsub: job 8336.manage.cluster ready
>>>>>>>
>>>>>>> [mishima_at_node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>> [mishima_at_node11 demos]$ cat $PBS_NODEFILE
>>>>>>> node11
>>>>>>> node11
>>>>>>> node11
>>>>>>> node11
>>>>>>> node11
>>>>>>> node11
>>>>>>> node11
>>>>>>> node11
>>>>>>> node12
>>>>>>> node12
>>>>>>> node12
>>>>>>> node12
>>>>>>> node12
>>>>>>> node12
>>>>>>> node12
>>>>>>> [mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4
>>> -report-bindings
>>>>>>> myprog
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>> A request was made to bind to that would result in binding more
>>>>>>> processes than cpus on a resource:
>>>>>>>
>>>>>>> Bind to: CORE
>>>>>>> Node: node12
>>>>>>> #processes: 2
>>>>>>> #cpus: 1
>>>>>>>
>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>> option to your binding directive.
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> Of course it works well using only one node.
>>>>>>>
>>>>>>> [mishima_at_node11 demos]$ mpirun -np 2 -cpus-per-proc 4
>>> -report-bindings
>>>>>>> myprog
>>>>>>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>>> socket
>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>>>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]],
>>>>> socket
>>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>>>> Hello world from process 1 of 2
>>>>>>> Hello world from process 0 of 2
>>>>>>>
>>>>>>>
>>>>>>> 2) Adding "-bind-to numa", it works but the message "bind:upward
>>> target
>>>>>>> NUMANode type NUMANode" appears.
>>>>>>> As far as I remember, I didn't see such a kind of message before.
>>>>>>>
>>>>>>> mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4
>> -report-bindings
>>>>>>> -bind-to numa myprog
>>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
>> type
>>>>>>> NUMANode
>>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
>> type
>>>>>>> NUMANode
>>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
>> type
>>>>>>> NUMANode
>>>>>>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode
>> type
>>>>>>> NUMANode
>>>>>>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>>> socket
>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>>>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]],
>>>>> socket
>>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>>>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]],
>>>>> socket
>>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>>>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]],
>>>>> socket
>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>>>> Hello world from process 1 of 4
>>>>>>> Hello world from process 0 of 4
>>>>>>> Hello world from process 3 of 4
>>>>>>> Hello world from process 2 of 4
>>>>>>>
>>>>>>>
>>>>>>> 3) I use PGI compiler. It can not accept compiler switch
>>>>>>> "-Wno-variadic-macros", which is
>>>>>>> included in configure script.
>>>>>>>
>>>>>>> btl_usnic_CFLAGS="-Wno-variadic-macros"
>>>>>>>
>>>>>>> I removed this switch, then I could continue to build 1.7.4rc1.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Tetsuya Mishima
>>>>>>>
>>>>>>>
>>>>>>>> Hmmm...okay, I understand the scenario. Must be something in the
>>> algo
>>>>>>> when it only has one node, so it shouldn't be too hard to track
>> down.
>>>>>>>>
>>>>>>>> I'm off on travel for a few days, but will return to this when I
>> get
>>>>>>> back.
>>>>>>>>
>>>>>>>> Sorry for delay - will try to look at this while I'm gone, but
>> can't
>>>>>>> promise anything :-(
>>>>>>>>
>>>>>>>>
>>>>>>>> On Dec 10, 2013, at 6:58 PM, tmishima_at_[hidden] wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Ralph, sorry for confusing.
>>>>>>>>>
>>>>>>>>> We usually logon to "manage", which is our control node.
>>>>>>>>> From manage, we submit job or enter a remote node such as
>>>>>>>>> node03 by torque interactive mode(qsub -I).
>>>>>>>>>
>>>>>>>>> At that time, instead of torque, I just did rsh to node03 from
>>> manage
>>>>>>>>> and ran myprog on the node. I hope you could understand what I
>> did.
>>>>>>>>>
>>>>>>>>> Now, I retried with "-host node03", which still causes the
>> problem:
>>>>>>>>> (I comfirmed local run on manage caused the same problem too)
>>>>>>>>>
>>>>>>>>> [mishima_at_manage ~]$ rsh node03
>>>>>>>>> Last login: Wed Dec 11 11:38:57 from manage
>>>>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>> [mishima_at_node03 demos]$
>>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03
>> -report-bindings
>>>>>>>>> -cpus-per-proc 4 -map-by socket myprog
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>
>>>>>>>>> Bind to: CORE
>>>>>>>>> Node: node03
>>>>>>>>> #processes: 2
>>>>>>>>> #cpus: 1
>>>>>>>>>
>>>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>>>> option to your binding directive.
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> It' strange, but I have to report that "-map-by socket:span"
>> worked
>>>>>>> well.
>>>>>>>>>
>>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03
>> -report-bindings
>>>>>>>>> -cpus-per-proc 4 -map-by socket:span myprog
>>>>>>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt
>> 0]],
>>>>>>> socket
>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt
>> 0]],
>>>>>>> socket
>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt
>> 0]],
>>>>>>> socket
>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>
>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Tetsuya Mishima
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Dec 10, 2013, at 6:05 PM, tmishima_at_[hidden] wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>
>>>>>>>>>>> I tried again with -cpus-per-proc 2 as shown below.
>>>>>>>>>>> Here, I found that "-map-by socket:span" worked well.
>>>>>>>>>>>
>>>>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
>>>>> -cpus-per-proc
>>>>>>> 2
>>>>>>>>>>> -map-by socket:span myprog
>>>>>>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
>>>>>>>>>>> /./././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././.
>>>>>>>>>>> /./././.][B/B/./././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././.
>>>>>>>>>>> /./././.][././B/B/./././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././.
>>>>>>>>>>> /./././.][./././././././.][B/B/./././././.]
>>>>>>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././.
>>>>>>>>>>> /./././.][./././././././.][././B/B/./././.]
>>>>>>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>>> Hello world from process 5 of 8> >>>>>>> Hello world from
>> process 3 of 8
>>>>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
>>>>> -cpus-per-proc
>>>>>>> 2
>>>>>>>>>>> -map-by socket myprog
>>>>>>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
>>>>>>>>>>> /./././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././.
>>>>>>>>>>> /B/B/./.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././.
>>>>>>>>>>> /././B/B][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
>>>>>>>>>>> /././.][./././././././.][./././././././.]
>>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>>>
>>>>>>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
>>>>>>>>>>> In this case, I guess "-map-by socket:span" and "-map-by
>> socket"
>>>>> has
>>>>>>>>> same
>>>>>>>>>>> meaning.
>>>>>>>>>>> Therefore, there's no problem about that. Sorry for distubing.
>>>>>>>>>>
>>>>>>>>>> No problem - glad you could clear that up :-)
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> By the way, through this test, I found another problem.
>>>>>>>>>>> Without torque manager and just using rsh, it causes the same
>>> error
>>>>>>>>> like
>>>>>>>>>>> below:
>>>>>>>>>>>
>>>>>>>>>>> [mishima_at_manage openmpi-1.7]$ rsh node03
>>>>>>>>>>> Last login: Wed Dec 11 09:42:02 from manage
>>>>>>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
>>>>> -cpus-per-proc
>>>>>>> 4
>>>>>>>>>>> -map-by socket myprog
>>>>>>>>>>
>>>>>>>>>> I don't understand the difference here - you are simply starting
>>> it
>>>>>>> from>>>>> a different node? It looks like everything is expected to
>>> run local
>>>>> to
>>>>>>>>> mpirun, yes? So there is no rsh actually involved here.
>>>>>>>>>> Are you still running in an allocation?
>>>>>>>>>>
>>>>>>>>>> If you run this with "-host node03" on the cmd line, do you see
>>> the
>>>>>>> same
>>>>>>>>> problem?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>>>
>>>>>>>>>>> Bind to: CORE
>>>>>>>>>>> Node: node03
>>>>>>>>>>> #processes: 2
>>>>>>>>>>> #cpus: 1
>>>>>>>>>>>
>>>>>>>>>>> You can override this protection by adding the
>> "overload-allowed"
>>>>>>>>>>> option to your binding directive.
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>>>>>> [mishima_at_node03 demos]$
>>>>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
>>>>> -cpus-per-proc
>>>>>>> 4
>>>>>>>>>>> myprog
>>>>>>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>> socket 3[core 27[hwt 0]]:>>>>>
>>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>
>>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>
>>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>
>>>>>>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but
>>> let
>>>>>>> me
>>>>>>>>>>> poke around a bit and see what might be happening.
>>>>>>>>>>>>
>>>>>>>>>>>> On Dec 10, 2013, at 4:47 PM, tmishima_at_[hidden] wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. I didn't know the meaning of "socket:span".
>>>>>>>>>>>>>
>>>>>>>>>>>>> But it still causes the problem, which seems socket:span
>>> doesn't
>>>>>>>>> work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=node03:ppn=32
>>>>>>>>>>>>> qsub: waiting for job 8265.manage.cluster to start
>>>>>>>>>>>>> qsub: job 8265.manage.cluster ready
>>>>>>>>>>>>>
>>>>>>>>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
>>>>>>> -cpus-per-proc
>>>>>>>>> 4
>>>>>>>>>>>>> -map-by socket:span myprog
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8
>> [hwt
>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12
>> [hwt
>>>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16
>> [hwt
>>>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20
>> [hwt
>>>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24
>> [hwt
>>>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28
>> [hwt
>>>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0
>> [hwt
>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4
>> [hwt
>>>>> 0]],
>>>>>>>>>>> socket
>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>>>
>>>>>>>
>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, that is actually correct. We map a socket until full,
>> then
>>>>>>> move
>>>>>>>>> to
>>>>>>>>>>>>> the next. What you want is --map-by socket:span
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmishima_at_[hidden]
>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I had a time to try your patch yesterday using
>>>>>>>>> openmpi-1.7.4a1r29646.
>>>>>>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by
>>>>>>> socket" itself
>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>> well as shown bellow:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=1:ppn=32
>>>>>>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start
>>>>>>>>>>>>>>> qsub: job 8260.manage.cluster ready
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [mishima_at_node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>>>>>>>> [mishima_at_node04 demos]$ mpirun -np 8 -report-bindings
>>>>>>>>> -cpus-per-proc
>>>>>>>>>>> 4
>>>>>>>>>>>>>>> -map-by socket myprog
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8
>>> [hwt
>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12
>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16
>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20
>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24
>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28
>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0
>>> [hwt
>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4
>>> [hwt
>>>>>>> 0]],
>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this should be like this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rank 00
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>> rank 01
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>> rank 02
>>>>>>>>>>>>>>>
>>>>>>>>>
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of
>> RM)
>>>>> and
>>>>>>>>>>> have
>>>>>>>>>>>>>>> scheduled it for 1.7.4.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmishima_at_[hidden]
>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you very much for your quick response.>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm afraid to say that I found one more issuse...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's not so serious. Please check it when you have a lot
>> of
>>>>>>> time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under
>>> Torque
>>>>>>>>>>>>> manager.
>>>>>>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the
>>> same
>>>>>>>>>>>>>>>>> behaviour under Slurm manager.
>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite
>> well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [mishima_at_manage testbed2]$ qsub -I -l nodes=1:ppn=32
>>>>>>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
>>>>>>>>>>>>>>>>> qsub: job 8116.manage.cluster ready
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [mishima_at_node03 ~]$ cd ~/Ducom/testbed2
>>>>>>>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
>>>>>>>>>>>>> -cpus-per-proc
>>>>>>>>>>>>>>> 4
>>>>>>>>>>>>>>>>> -map-by socket mPre
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>>> A request was made to bind to that would result in
>> binding
>>>>> more
>>>>>>>>>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Bind to: CORE
>>>>>>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2
>>>>>>>>>>>>>>>>> #cpus: 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can override this protection by adding the
>>>>>>> "overload-allowed"
>>>>>>>>>>>>>>>>> option to your binding directive.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
>>>>>>>>>>>>> -cpus-per-proc
>>>>>>>>>>>>>>> 4
>>>>>>>>>>>>>>>>> mPre
>>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core
>> 8
>>>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core
>> 12
>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core
>> 16
>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core
>> 20
>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core
>> 24
>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core
>> 28
>>>>> [hwt
>>>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>
>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>>
>>
>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0
>>>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core
>> 4
>>>>> [hwt
>>>>>>>>> 0]],
>>>>>>>>>>>>>>> socket
>>>>>>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>> Regards,
>>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain
>>>>> <rhc_at_[hidden]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when
>> I
>>>>> had
>>>>>>>>>>>>> time :-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'll update tomorrow.
>>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
>>>>>>>>>>> <tmishima_at_[hidden]>wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in
>>>>>>> oob_tcp.c
>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> openmpi-1.7.4a1r29646".
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I found the cause.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine
>>> can
>>>>>>> not.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Your host file:
>>>>>>>>>>>>>>>>>> cat hosts
>>>>>>>>>>>>>>>>>> bend001 slots=12
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My host file:
>>>>>>>>>>>>>>>>>> cat hosts
>>>>>>>>>>>>>>>>>> node08
>>>>>>>>>>>>>>>>>> node08
>>>>>>>>>>>>>>>>>> ...(total 8 lines)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line
>> of
>>>>> my
>>>>>>>>>>>>> hostfile
>>>>>>>>>>>>>>>>>> just before launching mpirun. Then it worked.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My host file(modified):
>>>>>>>>>>>>>>>>>> cat hosts
>>>>>>>>>>>>>>>>>> node08 slots=1
>>>>>>>>>>>>>>>>>> node08 slots=1
>>>>>>>>>>>>>>>>>> ...(total 8 lines)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference
>>>>>>> between
>>>>>>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
>>>>>>>>> 1.7.4a1r29646.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> $ diff
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>
>> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
>>>>>>>>>>>>>>>>>> 394,401c394,399
>>>>>>>>>>>>>>>>>> < if (got_count) {
>>>>>>>>>>>>>>>>>> < node->slots_given = true;
>>>>>>>>>>>>>>>>>> < } else if (got_max) {
>>>>>>>>>>>>>>>>>> < node->slots = node->slots_max;
>>>>>>>>>>>>>>>>>> < node->slots_given = true;
>>>>>>>>>>>>>>>>>> < } else {
>>>>>>>>>>>>>>>>>> < /* should be set by obj_new, but just to be
>>> clear
>>>>> */
>>>>>>>>>>>>>>>>>> < node->slots_given = false;
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>> if (!got_count) {
>>>>>>>>>>>>>>>>>>> if (got_max) {
>>>>>>>>>>>>>>>>>>> node->slots = node->slots_max;
>>>>>>>>>>>>>>>>>>> } else {
>>>>>>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative
>>>>> trial.
>>>>>>>>>>>>>>>>>> Then, it worked.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>> 394 if (got_count) {
>>>>>>>>>>>>>>>>>> 395 node->slots_given = true;
>>>>>>>>>>>>>>>>>> 396 } else if (got_max) {
>>>>>>>>>>>>>>>>>> 397 node->slots = node->slots_max;
>>>>>>>>>>>>>>>>>> 398 node->slots_given = true;
>>>>>>>>>>>>>>>>>> 399 } else {
>>>>>>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be
>>>>> clear
>>>>>>>>> */
>>>>>>>>>>>>>>>>>> 401 node->slots_given
>>>>> = false;
>>>>>>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */
>>>>>>>>>>>>>>>>>> 403 }
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please fix the problem properly, because it's just based
>>> on
>>>>> my
>>>>>>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile
>>>>> where
>>>>>>>>>>> slots
>>>>>>>>>>>>>>>>>> information is not given.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________
>>
>>>
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>