Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-12-18 19:43:29


I removed the debug in #2 - thanks for reporting it

For #1, it actually looks to me like this is correct. If you look at your allocation, there are only 7 slots being allocated on node12, yet you have asked for 8 cpus to be assigned (2 procs with 2 cpus/proc). So the warning is in fact correct

On Dec 18, 2013, at 4:04 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So I'd like
> to report
> 3 issues mainly regarding -cpus-per-proc.
>
> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 sockets X
> 4 cores/socket),
> it starts to produce the error again as shown below. At least,
> openmpi-1.7.4a1r29646 did
> work well.
>
> [mishima_at_manage ~]$ qsub -I -l nodes=2:ppn=8
> qsub: waiting for job 8336.manage.cluster to start
> qsub: job 8336.manage.cluster ready
>
> [mishima_at_node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima_at_node11 demos]$ cat $PBS_NODEFILE
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node12
> node12
> node12
> node12
> node12
> node12
> node12
> [mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
> Bind to: CORE
> Node: node12
> #processes: 2
> #cpus: 1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
>
> Of course it works well using only one node.
>
> [mishima_at_node11 demos]$ mpirun -np 2 -cpus-per-proc 4 -report-bindings
> myprog
> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> Hello world from process 1 of 2
> Hello world from process 0 of 2
>
>
> 2) Adding "-bind-to numa", it works but the message "bind:upward target
> NUMANode type NUMANode" appears.
> As far as I remember, I didn't see such a kind of message before.
>
> mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> -bind-to numa myprog
> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> NUMANode
> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> NUMANode
> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> NUMANode
> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> NUMANode
> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> Hello world from process 1 of 4
> Hello world from process 0 of 4
> Hello world from process 3 of 4
> Hello world from process 2 of 4
>
>
> 3) I use PGI compiler. It can not accept compiler switch
> "-Wno-variadic-macros", which is
> included in configure script.
>
> btl_usnic_CFLAGS="-Wno-variadic-macros"
>
> I removed this switch, then I could continue to build 1.7.4rc1.
>
> Regards,
> Tetsuya Mishima
>
>
>> Hmmm...okay, I understand the scenario. Must be something in the algo
> when it only has one node, so it shouldn't be too hard to track down.
>>
>> I'm off on travel for a few days, but will return to this when I get
> back.
>>
>> Sorry for delay - will try to look at this while I'm gone, but can't
> promise anything :-(
>>
>>
>> On Dec 10, 2013, at 6:58 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Hi Ralph, sorry for confusing.
>>>
>>> We usually logon to "manage", which is our control node.
>>> From manage, we submit job or enter a remote node such as
>>> node03 by torque interactive mode(qsub -I).
>>>
>>> At that time, instead of torque, I just did rsh to node03 from manage
>>> and ran myprog on the node. I hope you could understand what I did.
>>>
>>> Now, I retried with "-host node03", which still causes the problem:
>>> (I comfirmed local run on manage caused the same problem too)
>>>
>>> [mishima_at_manage ~]$ rsh node03
>>> Last login: Wed Dec 11 11:38:57 from manage
>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>> [mishima_at_node03 demos]$
>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03 -report-bindings
>>> -cpus-per-proc 4 -map-by socket myprog
>>>
> --------------------------------------------------------------------------
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>>
>>> Bind to: CORE
>>> Node: node03
>>> #processes: 2
>>> #cpus: 1
>>>
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>>
> --------------------------------------------------------------------------
>>>
>>> It' strange, but I have to report that "-map-by socket:span" worked
> well.
>>>
>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03 -report-bindings
>>> -cpus-per-proc 4 -map-by socket:span myprog
>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> socket
>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>> ocket 1[core 11[hwt 0]]:
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]],
> socket
>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>> socket 1[core 15[hwt 0]]:
>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]],
> socket
>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>> socket 2[core 19[hwt 0]]:
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]],
> socket
>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>> socket 2[core 23[hwt 0]]:
>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]],
> socket
>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>> socket 3[core 27[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]],
> socket
>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>> socket 3[core 31[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]:
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> socket
>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>> cket 0[core 7[hwt 0]]:
>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>> Hello world from process 2 of 8
>>> Hello world from process 6 of 8
>>> Hello world from process 3 of 8
>>> Hello world from process 7 of 8
>>> Hello world from process 1 of 8
>>> Hello world from process 5 of 8
>>> Hello world from process 0 of 8
>>> Hello world from process 4 of 8
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>>
>>>> On Dec 10, 2013, at 6:05 PM, tmishima_at_[hidden] wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi Ralph,
>>>>>
>>>>> I tried again with -cpus-per-proc 2 as shown below.
>>>>> Here, I found that "-map-by socket:span" worked well.
>>>>>
>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
> 2
>>>>> -map-by socket:span myprog
>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]],
>>> socket
>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]],
>>> socket
>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
>>>>> /./././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]],
>>> socket
>>>>> 2[core 17[hwt 0]]: [./././././././.][./././.
>>>>> /./././.][B/B/./././././.][./././././././.]
>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]],
>>> socket
>>>>> 2[core 19[hwt 0]]: [./././././././.][./././.
>>>>> /./././.][././B/B/./././.][./././././././.]
>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]],
>>> socket
>>>>> 3[core 25[hwt 0]]: [./././././././.][./././.
>>>>> /./././.][./././././././.][B/B/./././././.]
>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]],
>>> socket
>>>>> 3[core 27[hwt 0]]: [./././././././.][./././.
>>>>> /./././.][./././././././.][././B/B/./././.]
>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>> socket
>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]],
>>> socket
>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> Hello world from process 1 of 8
>>>>> Hello world from process 0 of 8
>>>>> Hello world from process 4 of 8
>>>>> Hello world from process 2 of 8
>>>>> Hello world from process 7 of 8
>>>>> Hello world from process 6 of 8
>>>>> Hello world from process 5 of 8
>>>>> Hello world from process 3 of 8
>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
> 2
>>>>> -map-by socket myprog
>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]],
>>> socket
>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]],
>>> socket
>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]],
>>> socket
>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]],
>>> socket
>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
>>>>> /./././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]],
>>> socket
>>>>> 1[core 13[hwt 0]]: [./././././././.][./././.
>>>>> /B/B/./.][./././././././.][./././././././.]
>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]],
>>> socket
>>>>> 1[core 15[hwt 0]]: [./././././././.][./././.
>>>>> /././B/B][./././././././.][./././././././.]
>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>> socket
>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]],
>>> socket
>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
>>>>> /././.][./././././././.][./././././././.]
>>>>> Hello world from process 5 of 8
>>>>> Hello world from process 1 of 8
>>>>> Hello world from process 6 of 8
>>>>> Hello world from process 4 of 8
>>>>> Hello world from process 2 of 8
>>>>> Hello world from process 0 of 8
>>>>> Hello world from process 7 of 8
>>>>> Hello world from process 3 of 8
>>>>>
>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
>>>>> In this case, I guess "-map-by socket:span" and "-map-by socket" has
>>> same
>>>>> meaning.
>>>>> Therefore, there's no problem about that. Sorry for distubing.
>>>>
>>>> No problem - glad you could clear that up :-)
>>>>
>>>>>
>>>>> By the way, through this test, I found another problem.
>>>>> Without torque manager and just using rsh, it causes the same error
>>> like
>>>>> below:
>>>>>
>>>>> [mishima_at_manage openmpi-1.7]$ rsh node03
>>>>> Last login: Wed Dec 11 09:42:02 from manage
>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
> 4
>>>>> -map-by socket myprog
>>>>
>>>> I don't understand the difference here - you are simply starting it
> from
>>> a different node? It looks like everything is expected to run local to
>>> mpirun, yes? So there is no rsh actually involved here.
>>>> Are you still running in an allocation?
>>>>
>>>> If you run this with "-host node03" on the cmd line, do you see the
> same
>>> problem?
>>>>
>>>>
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>> A request was made to bind to that would result in binding more
>>>>> processes than cpus on a resource:
>>>>>
>>>>> Bind to: CORE
>>>>> Node: node03
>>>>> #processes: 2
>>>>> #cpus: 1
>>>>>
>>>>> You can override this protection by adding the "overload-allowed"
>>>>> option to your binding directive.
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>> [mishima_at_node03 demos]$
>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
> 4
>>>>> myprog
>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]],
>>> socket
>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>> ocket 1[core 11[hwt 0]]:
>>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt 0]],
>>> socket
>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>> socket 1[core 15[hwt 0]]:
>>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt 0]],
>>> socket
>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>> socket 2[core 19[hwt 0]]:
>>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt 0]],
>>> socket
>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>> socket 2[core 23[hwt 0]]:
>>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt 0]],
>>> socket
>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>> socket 3[core 27[hwt 0]]:
>>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt 0]],
>>> socket
>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>> socket 3[core 31[hwt 0]]:
>>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>> socket
>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>> cket 0[core 3[hwt 0]]:
>>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt 0]],
>>> socket
>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>> cket 0[core 7[hwt 0]]:
>>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>> Hello world from process 4 of 8
>>>>> Hello world from process 2 of 8
>>>>> Hello world from process 6 of 8
>>>>> Hello world from process 5 of 8
>>>>> Hello world from process 3 of 8
>>>>> Hello world from process 7 of 8
>>>>> Hello world from process 0 of 8
>>>>> Hello world from process 1 of 8
>>>>>
>>>>> Regards,
>>>>> Tetsuya Mishima
>>>>>
>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but let
> me
>>>>> poke around a bit and see what might be happening.
>>>>>>
>>>>>> On Dec 10, 2013, at 4:47 PM, tmishima_at_[hidden] wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Ralph,
>>>>>>>
>>>>>>> Thanks. I didn't know the meaning of "socket:span".
>>>>>>>
>>>>>>> But it still causes the problem, which seems socket:span doesn't
>>> work.
>>>>>>>
>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=node03:ppn=32
>>>>>>> qsub: waiting for job 8265.manage.cluster to start
>>>>>>> qsub: job 8265.manage.cluster ready
>>>>>>>
>>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> -cpus-per-proc
>>> 4
>>>>>>> -map-by socket:span myprog
>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]],
>>>>> socket
>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt
> 0]],
>>>>> socket
>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt
> 0]],
>>>>> socket
>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt
> 0]],
>>>>> socket
>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt
> 0]],
>>>>> socket
>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt
> 0]],
>>>>> socket
>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>>> socket
>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]],
>>>>> socket
>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>> Hello world from process 0 of 8
>>>>>>> Hello world from process 3 of 8
>>>>>>> Hello world from process 1 of 8
>>>>>>> Hello world from process 4 of 8
>>>>>>> Hello world from process 6 of 8
>>>>>>> Hello world from process 5 of 8
>>>>>>> Hello world from process 2 of 8
>>>>>>> Hello world from process 7 of 8
>>>>>>>
>>>>>>> Regards,
>>>>>>> Tetsuya Mishima
>>>>>>>
>>>>>>>> No, that is actually correct. We map a socket until full, then
> move
>>> to
>>>>>>> the next. What you want is --map-by socket:span
>>>>>>>>
>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmishima_at_[hidden] wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Ralph,
>>>>>>>>>
>>>>>>>>> I had a time to try your patch yesterday using
>>> openmpi-1.7.4a1r29646.
>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by
> socket" itself
>>>>>>> didn't
>>>>>>>>> work
>>>>>>>>> well as shown bellow:
>>>>>>>>>
>>>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=1:ppn=32
>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start
>>>>>>>>> qsub: job 8260.manage.cluster ready
>>>>>>>>>
>>>>>>>>> [mishima_at_node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>>>>>> [mishima_at_node04 demos]$ mpirun -np 8 -report-bindings
>>> -cpus-per-proc
>>>>> 4
>>>>>>>>> -map-by socket myprog
>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt
> 0]],
>>>>>>> socket
>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt
>>> 0]],
>>>>>>> socket
>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>
>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
>>>>>>> socket
>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt
> 0]],
>>>>>>> socket
>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>
>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>> Hello world from process 2 of 8
>>>>>>>>> Hello world from process 1 of 8
>>>>>>>>> Hello world from process 3 of 8
>>>>>>>>> Hello world from process 0 of 8
>>>>>>>>> Hello world from process 6 of 8
>>>>>>>>> Hello world from process 5 of 8
>>>>>>>>> Hello world from process 4 of 8
>>>>>>>>> Hello world from process 7 of 8
>>>>>>>>>
>>>>>>>>> I think this should be like this:
>>>>>>>>>
>>>>>>>>> rank 00
>>>>>>>>>
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>> rank 01
>>>>>>>>>
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>> rank 02
>>>>>>>>>
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Tetsuya Mishima
>>>>>>>>>
>>>>>>>>>> I fixed this under the trunk (was an issue regardless of RM) and
>>>>> have
>>>>>>>>> scheduled it for 1.7.4.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> Ralph
>>>>>>>>>>
>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmishima_at_[hidden] wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much for your quick response.
>>>>>>>>>>>
>>>>>>>>>>> I'm afraid to say that I found one more issuse...
>>>>>>>>>>>
>>>>>>>>>>> It's not so serious. Please check it when you have a lot of
> time.
>>>>>>>>>>>
>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under Torque
>>>>>>> manager.
>>>>>>>>>>> It doesn't work as shown below. I guess you can get the same
>>>>>>>>>>> behaviour under Slurm manager.
>>>>>>>>>>>
>>>>>>>>>>> Of course, if I remove -map-by option, it works quite well.
>>>>>>>>>>>
>>>>>>>>>>> [mishima_at_manage testbed2]$ qsub -I -l nodes=1:ppn=32
>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
>>>>>>>>>>> qsub: job 8116.manage.cluster ready
>>>>>>>>>>>
>>>>>>>>>>> [mishima_at_node03 ~]$ cd ~/Ducom/testbed2
>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
>>>>>>> -cpus-per-proc
>>>>>>>>> 4
>>>>>>>>>>> -map-by socket mPre
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>>>
>>>>>>>>>>> Bind to: CORE
>>>>>>>>>>> Node: node03>>>>>>> #processes: 2
>>>>>>>>>>> #cpus: 1
>>>>>>>>>>>
>>>>>>>>>>> You can override this protection by adding the
> "overload-allowed"
>>>>>>>>>>> option to your binding directive.
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
>>>>>>> -cpus-per-proc
>>>>>>>>> 4
>>>>>>>>>>> mPre
>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt
>>>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>>>>>
>>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>>>>>
>>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt
>>> 0]],
>>>>>>>>> socket
>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>>>>>
>>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>
>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <rhc_at_[hidden]>
>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks! That's precisely where I was going to look when I had
>>>>>>> time :-)
>>>>>>>>>>>>
>>>>>>>>>>>> I'll update tomorrow.
>>>>>>>>>>>> Ralph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
>>>>> <tmishima_at_[hidden]>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>
>>>>>>>>>>>> This is the continuous story of "Segmentation fault in
> oob_tcp.c
>>>>> of
>>>>>>>>>>>> openmpi-1.7.4a1r29646".
>>>>>>>>>>>>
>>>>>>>>>>>> I found the cause.
>>>>>>>>>>>>
>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine can
> not.
>>>>>>>>>>>>
>>>>>>>>>>>> Your host file:
>>>>>>>>>>>> cat hosts
>>>>>>>>>>>> bend001 slots=12
>>>>>>>>>>>>
>>>>>>>>>>>> My host file:
>>>>>>>>>>>> cat hosts
>>>>>>>>>>>> node08
>>>>>>>>>>>> node08
>>>>>>>>>>>> ...(total 8 lines)
>>>>>>>>>>>>
>>>>>>>>>>>> I modified my script file to add "slots=1" to each line of my
>>>>>>> hostfile
>>>>>>>>>>>> just before launching mpirun. Then it worked.
>>>>>>>>>>>>
>>>>>>>>>>>> My host file(modified):
>>>>>>>>>>>> cat hosts
>>>>>>>>>>>> node08 slots=1
>>>>>>>>>>>> node08 slots=1
>>>>>>>>>>>> ...(total 8 lines)
>>>>>>>>>>>>
>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference
> between
>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
>>> 1.7.4a1r29646.
>>>>>>>>>>>>
>>>>>>>>>>>> $ diff
>>>>>>>>>>>>
>>>>>>>
>>> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
>>>>>>>>>>>> 394,401c394,399
>>>>>>>>>>>> < if (got_count) {
>>>>>>>>>>>> < node->slots_given = true;
>>>>>>>>>>>> < } else if (got_max) {
>>>>>>>>>>>> < node->slots = node->slots_max;
>>>>>>>>>>>> < node->slots_given = true;
>>>>>>>>>>>> < } else {
>>>>>>>>>>>> < /* should be set by obj_new, but just to be clear */
>>>>>>>>>>>> < node->slots_given = false;
>>>>>>>>>>>> ---
>>>>>>>>>>>>> if (!got_count) {
>>>>>>>>>>>>> if (got_max) {
>>>>>>>>>>>>> node->slots = node->slots_max;
>>>>>>>>>>>>> } else {
>>>>>>>>>>>>> ++node->slots;
>>>>>>>>>>>>> }
>>>>>>>>>>>> ....
>>>>>>>>>>>>
>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative trial.
>>>>>>>>>>>> Then, it worked.
>>>>>>>>>>>>
>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
>>>>>>>>>>>> ...
>>>>>>>>>>>> 394 if (got_count) {
>>>>>>>>>>>> 395 node->slots_given = true;
>>>>>>>>>>>> 396 } else if (got_max) {
>>>>>>>>>>>> 397 node->slots = node->slots_max;
>>>>>>>>>>>> 398 node->slots_given = true;
>>>>>>>>>>>> 399 } else {
>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be clear
>>> */
>>>>>>>>>>>> 401 node->slots_given = false;
>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */
>>>>>>>>>>>> 403 }
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> Please fix the problem properly, because it's just based on my
>>>>>>>>>>>> random guess. It's related to the treatment of hostfile where
>>>>> slots
>>>>>>>>>>>> information is not given.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Tetsuya Mishima
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________
>
>>>
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>
>>>>> users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users