Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: tmishima_at_[hidden]
Date: 2013-12-18 20:15:24


Yes, it's very strange. But I don't think there's any chance that
I have < 8 actual cores on the node. I guess that you cat replicate
it with SLURM, please try it again.

I changed to use node10 and node11, then I got the warning against
node11.

Furthermore, just as an information for you, I tried to add
"-bind-to core:overload-allowed", then it worked as shown below.
But I think node11 is never overloaded because it has 8 cores.

qsub: job 8342.manage.cluster completed
[mishima_at_manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
qsub: waiting for job 8343.manage.cluster to start
qsub: job 8343.manage.cluster ready

[mishima_at_node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima_at_node10 demos]$ cat $PBS_NODEFILE
node10
node10
node10
node10
node10
node10
node10
node10
node11
node11
node11
node11
node11
node11
node11
node11
[mishima_at_node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
myprog
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node: node11
   #processes: 2
   #cpus: 1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
[mishima_at_node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
-bind-to core:overload-allowed myprog
[node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
[node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
Hello world from process 1 of 4
Hello world from process 0 of 4
Hello world from process 3 of 4
Hello world from process 2 of 4

Regards,
Tetsuya Mishima

> Very strange - I can't seem to replicate it. Is there any chance that you
have < 8 actual cores on node12?
>
>
> On Dec 18, 2013, at 4:53 PM, tmishima_at_[hidden] wrote:
>
> >
> >
> > Hi Ralph, sorry for confusing you.
> >
> > At that time, I cut and paste the part of "cat $PBS_NODEFILE".
> > I guess I didn't paste the last line by my mistake.
> >
> > I retried the test and below one is exactly what I got when I did the
test.
> >
> > [mishima_at_manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8
> > qsub: waiting for job 8338.manage.cluster to start
> > qsub: job 8338.manage.cluster ready
> >
> > [mishima_at_node11 ~]$ cat $PBS_NODEFILE
> > node11
> > node11
> > node11
> > node11
> > node11
> > node11
> > node11
> > node11
> > node12
> > node12
> > node12
> > node12
> > node12
> > node12
> > node12
> > node12
> > [mishima_at_node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
myprog
> >
--------------------------------------------------------------------------
> > A request was made to bind to that would result in binding more
> > processes than cpus on a resource:
> >
> > Bind to: CORE
> > Node: node12
> > #processes: 2
> > #cpus: 1
> >
> > You can override this protection by adding the "overload-allowed"
> > option to your binding directive.
> >
--------------------------------------------------------------------------
> >
> > Regards,
> >
> > Tetsuya Mishima
> >
> >> I removed the debug in #2 - thanks for reporting it
> >>
> >> For #1, it actually looks to me like this is correct. If you look at
your
> > allocation, there are only 7 slots being allocated on node12, yet you
have
> > asked for 8 cpus to be assigned (2 procs with 2
> >> cpus/proc). So the warning is in fact correct
> >>
> >>
> >> On Dec 18, 2013, at 4:04 PM, tmishima_at_[hidden] wrote:
> >>
> >>>
> >>>
> >>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So I'd
> > like
> >>> to report
> >>> 3 issues mainly regarding -cpus-per-proc.
> >>>
> >>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2
> > sockets X
> >>> 4 cores/socket),
> >>> it starts to produce the error again as shown below. At least,
> >>> openmpi-1.7.4a1r29646 did
> >>> work well.
> >>>
> >>> [mishima_at_manage ~]$ qsub -I -l nodes=2:ppn=8
> >>> qsub: waiting for job 8336.manage.cluster to start
> >>> qsub: job 8336.manage.cluster ready
> >>>
> >>> [mishima_at_node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>> [mishima_at_node11 demos]$ cat $PBS_NODEFILE
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node11
> >>> node12
> >>> node12
> >>> node12
> >>> node12
> >>> node12
> >>> node12
> >>> node12
> >>> [mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4
-report-bindings
> >>> myprog
> >>>
> >
--------------------------------------------------------------------------
> >>> A request was made to bind to that would result in binding more
> >>> processes than cpus on a resource:
> >>>
> >>> Bind to: CORE
> >>> Node: node12
> >>> #processes: 2
> >>> #cpus: 1
> >>>
> >>> You can override this protection by adding the "overload-allowed"
> >>> option to your binding directive.
> >>>
> >
--------------------------------------------------------------------------
> >>>
> >>> Of course it works well using only one node.
> >>>
> >>> [mishima_at_node11 demos]$ mpirun -np 2 -cpus-per-proc 4
-report-bindings
> >>> myprog
> >>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> > socket
> >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>> Hello world from process 1 of 2
> >>> Hello world from process 0 of 2
> >>>
> >>>
> >>> 2) Adding "-bind-to numa", it works but the message "bind:upward
target
> >>> NUMANode type NUMANode" appears.
> >>> As far as I remember, I didn't see such a kind of message before.
> >>>
> >>> mishima_at_node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> >>> -bind-to numa myprog
> >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> > socket
> >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> > socket
> >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>> Hello world from process 1 of 4
> >>> Hello world from process 0 of 4
> >>> Hello world from process 3 of 4
> >>> Hello world from process 2 of 4
> >>>
> >>>
> >>> 3) I use PGI compiler. It can not accept compiler switch
> >>> "-Wno-variadic-macros", which is
> >>> included in configure script.
> >>>
> >>> btl_usnic_CFLAGS="-Wno-variadic-macros"
> >>>
> >>> I removed this switch, then I could continue to build 1.7.4rc1.
> >>>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
> >>>
> >>>> Hmmm...okay, I understand the scenario. Must be something in the
algo
> >>> when it only has one node, so it shouldn't be too hard to track down.
> >>>>
> >>>> I'm off on travel for a few days, but will return to this when I get
> >>> back.
> >>>>
> >>>> Sorry for delay - will try to look at this while I'm gone, but can't
> >>> promise anything :-(
> >>>>
> >>>>
> >>>> On Dec 10, 2013, at 6:58 PM, tmishima_at_[hidden] wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> Hi Ralph, sorry for confusing.
> >>>>>
> >>>>> We usually logon to "manage", which is our control node.
> >>>>> From manage, we submit job or enter a remote node such as
> >>>>> node03 by torque interactive mode(qsub -I).
> >>>>>
> >>>>> At that time, instead of torque, I just did rsh to node03 from
manage
> >>>>> and ran myprog on the node. I hope you could understand what I did.
> >>>>>
> >>>>> Now, I retried with "-host node03", which still causes the problem:
> >>>>> (I comfirmed local run on manage caused the same problem too)
> >>>>>
> >>>>> [mishima_at_manage ~]$ rsh node03
> >>>>> Last login: Wed Dec 11 11:38:57 from manage
> >>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>> [mishima_at_node03 demos]$
> >>>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03 -report-bindings
> >>>>> -cpus-per-proc 4 -map-by socket myprog
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>> A request was made to bind to that would result in binding more
> >>>>> processes than cpus on a resource:
> >>>>>
> >>>>> Bind to: CORE
> >>>>> Node: node03
> >>>>> #processes: 2
> >>>>> #cpus: 1
> >>>>>
> >>>>> You can override this protection by adding the "overload-allowed"
> >>>>> option to your binding directive.
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>
> >>>>> It' strange, but I have to report that "-map-by socket:span" worked
> >>> well.
> >>>>>
> >>>>> [mishima_at_node03 demos]$ mpirun -np 8 -host node03 -report-bindings
> >>>>> -cpus-per-proc 4 -map-by socket:span myprog
> >>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> >>> socket
> >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>> ocket 1[core 11[hwt 0]]:
> >>>>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt
0]],
> >>> socket
> >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>> socket 1[core 15[hwt 0]]:
> >>>>>
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt
0]],
> >>> socket
> >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>> socket 2[core 19[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt
0]],
> >>> socket
> >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>> socket 2[core 23[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt
0]],
> >>> socket
> >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>> socket 3[core 27[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt
0]],
> >>> socket
> >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>> socket 3[core 31[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> >>> socket
> >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>> cket 0[core 3[hwt 0]]:
> >>>>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> >>> socket
> >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>> cket 0[core 7[hwt 0]]:
> >>>>>
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>> Hello world from process 2 of 8
> >>>>> Hello world from process 6 of 8
> >>>>> Hello world from process 3 of 8
> >>>>> Hello world from process 7 of 8
> >>>>> Hello world from process 1 of 8
> >>>>> Hello world from process 5 of 8
> >>>>> Hello world from process 0 of 8
> >>>>> Hello world from process 4 of 8
> >>>>>
> >>>>> Regards,
> >>>>> Tetsuya Mishima
> >>>>>
> >>>>>
> >>>>>> On Dec 10, 2013, at 6:05 PM, tmishima_at_[hidden] wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Ralph,
> >>>>>>>
> >>>>>>> I tried again with -cpus-per-proc 2 as shown below.
> >>>>>>> Here, I found that "-map-by socket:span" worked well.
> >>>>>>>
> >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > -cpus-per-proc
> >>> 2
> >>>>>>> -map-by socket:span myprog
> >>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt
0]],
> >>>>> socket
> >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> >>>>>>> /./././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././.
> >>>>>>> /./././.][B/B/./././././.][./././././././.]
> >>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././.
> >>>>>>> /./././.][././B/B/./././.][./././././././.]
> >>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././.
> >>>>>>> /./././.][./././././././.][B/B/./././././.]
> >>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././.
> >>>>>>> /./././.][./././././././.][././B/B/./././.]
> >>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> Hello world from process 1 of 8
> >>>>>>> Hello world from process 0 of 8
> >>>>>>> Hello world from process 4 of 8
> >>>>>>> Hello world from process 2 of 8
> >>>>>>> Hello world from process 7 of 8
> >>>>>>> Hello world from process 6 of 8
> >>>>>>> Hello world from process 5 of 8
> >>>>>>> Hello world from process 3 of 8
> >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > -cpus-per-proc
> >>> 2
> >>>>>>> -map-by socket myprog
> >>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt
0]],
> >>>>> socket
> >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> >>>>>>> /./././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././.
> >>>>>>> /B/B/./.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././.
> >>>>>>> /././B/B][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> >>>>>>> /././.][./././././././.][./././././././.]
> >>>>>>> Hello world from process 5 of 8
> >>>>>>> Hello world from process 1 of 8
> >>>>>>> Hello world from process 6 of 8
> >>>>>>> Hello world from process 4 of 8
> >>>>>>> Hello world from process 2 of 8
> >>>>>>> Hello world from process 0 of 8
> >>>>>>> Hello world from process 7 of 8
> >>>>>>> Hello world from process 3 of 8
> >>>>>>>
> >>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
> >>>>>>> In this case, I guess "-map-by socket:span" and "-map-by socket"
> > has
> >>>>> same
> >>>>>>> meaning.
> >>>>>>> Therefore, there's no problem about that. Sorry for distubing.
> >>>>>>
> >>>>>> No problem - glad you could clear that up :-)
> >>>>>>
> >>>>>>>
> >>>>>>> By the way, through this test, I found another problem.
> >>>>>>> Without torque manager and just using rsh, it causes the same
error
> >>>>> like
> >>>>>>> below:
> >>>>>>>
> >>>>>>> [mishima_at_manage openmpi-1.7]$ rsh node03
> >>>>>>> Last login: Wed Dec 11 09:42:02 from manage
> >>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > -cpus-per-proc
> >>> 4
> >>>>>>> -map-by socket myprog
> >>>>>>
> >>>>>> I don't understand the difference here - you are simply starting
it
> >>> from>>>>> a different node? It looks like everything is expected to
run local
> > to
> >>>>> mpirun, yes? So there is no rsh actually involved here.
> >>>>>> Are you still running in an allocation?
> >>>>>>
> >>>>>> If you run this with "-host node03" on the cmd line, do you see
the
> >>> same
> >>>>> problem?
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>> A request was made to bind to that would result in binding more
> >>>>>>> processes than cpus on a resource:
> >>>>>>>
> >>>>>>> Bind to: CORE
> >>>>>>> Node: node03
> >>>>>>> #processes: 2
> >>>>>>> #cpus: 1
> >>>>>>>
> >>>>>>> You can override this protection by adding the "overload-allowed"
> >>>>>>> option to your binding directive.
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>> [mishima_at_node03 demos]$
> >>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> > -cpus-per-proc
> >>> 4
> >>>>>>> myprog
> >>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt
0]],
> >>>>> socket
> >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>
> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>
> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>> socket 3[core 27[hwt 0]]:>>>>>
> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>
> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>
> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>> Hello world from process 4 of 8
> >>>>>>> Hello world from process 2 of 8
> >>>>>>> Hello world from process 6 of 8
> >>>>>>> Hello world from process 5 of 8
> >>>>>>> Hello world from process 3 of 8
> >>>>>>> Hello world from process 7 of 8
> >>>>>>> Hello world from process 0 of 8
> >>>>>>> Hello world from process 1 of 8
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Tetsuya Mishima
> >>>>>>>
> >>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but
let
> >>> me
> >>>>>>> poke around a bit and see what might be happening.
> >>>>>>>>
> >>>>>>>> On Dec 10, 2013, at 4:47 PM, tmishima_at_[hidden] wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Ralph,
> >>>>>>>>>
> >>>>>>>>> Thanks. I didn't know the meaning of "socket:span".
> >>>>>>>>>
> >>>>>>>>> But it still causes the problem, which seems socket:span
doesn't
> >>>>> work.
> >>>>>>>>>
> >>>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=node03:ppn=32
> >>>>>>>>> qsub: waiting for job 8265.manage.cluster to start
> >>>>>>>>> qsub: job 8265.manage.cluster ready
> >>>>>>>>>
> >>>>>>>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings
> >>> -cpus-per-proc
> >>>>> 4
> >>>>>>>>> -map-by socket:span myprog
> >>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt
> > 0]],
> >>>>>>> socket
> >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt
> > 0]],
> >>>>>>> socket
> >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>
> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt
> > 0]],
> >>>>>>> socket
> >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>
> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>> Hello world from process 0 of 8
> >>>>>>>>> Hello world from process 3 of 8
> >>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>> Hello world from process 5 of 8
> >>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Tetsuya Mishima
> >>>>>>>>>
> >>>>>>>>>> No, that is actually correct. We map a socket until full, then
> >>> move
> >>>>> to
> >>>>>>>>> the next. What you want is --map-by socket:span
> >>>>>>>>>>
> >>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmishima_at_[hidden] wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>
> >>>>>>>>>>> I had a time to try your patch yesterday using
> >>>>> openmpi-1.7.4a1r29646.
> >>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by
> >>> socket" itself
> >>>>>>>>> didn't
> >>>>>>>>>>> work
> >>>>>>>>>>> well as shown bellow:
> >>>>>>>>>>>
> >>>>>>>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=1:ppn=32
> >>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start
> >>>>>>>>>>> qsub: job 8260.manage.cluster ready
> >>>>>>>>>>>
> >>>>>>>>>>> [mishima_at_node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>>>>>> [mishima_at_node04 demos]$ mpirun -np 8 -report-bindings
> >>>>> -cpus-per-proc
> >>>>>>> 4
> >>>>>>>>>>> -map-by socket myprog
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28
[hwt
> >>>>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4
[hwt
> >>> 0]],
> >>>>>>>>> socket
> >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>>>
> >>>>>
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>> Hello world from process 2 of 8
> >>>>>>>>>>> Hello world from process 1 of 8
> >>>>>>>>>>> Hello world from process 3 of 8
> >>>>>>>>>>> Hello world from process 0 of 8
> >>>>>>>>>>> Hello world from process 6 of 8
> >>>>>>>>>>> Hello world from process 5 of 8
> >>>>>>>>>>> Hello world from process 4 of 8
> >>>>>>>>>>> Hello world from process 7 of 8
> >>>>>>>>>>>
> >>>>>>>>>>> I think this should be like this:
> >>>>>>>>>>>
> >>>>>>>>>>> rank 00
> >>>>>>>>>>>
> >>>>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>> rank 01
> >>>>>>>>>>>
> >>>>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>> rank 02
> >>>>>>>>>>>
> >>>>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>
> >>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of RM)
> > and
> >>>>>>> have
> >>>>>>>>>>> scheduled it for 1.7.4.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks!
> >>>>>>>>>>>> Ralph
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmishima_at_[hidden]
wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you very much for your quick response.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm afraid to say that I found one more issuse...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It's not so serious. Please check it when you have a lot of
> >>> time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under
Torque
> >>>>>>>>> manager.
> >>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the
same
> >>>>>>>>>>>>> behaviour under Slurm manager.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [mishima_at_manage testbed2]$ qsub -I -l nodes=1:ppn=32
> >>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
> >>>>>>>>>>>>> qsub: job 8116.manage.cluster ready
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [mishima_at_node03 ~]$ cd ~/Ducom/testbed2
> >>>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
> >>>>>>>>> -cpus-per-proc
> >>>>>>>>>>> 4
> >>>>>>>>>>>>> -map-by socket mPre
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>>>>>> A request was made to bind to that would result in binding
> > more
> >>>>>>>>>>>>> processes than cpus on a resource:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Bind to: CORE
> >>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2
> >>>>>>>>>>>>> #cpus: 1
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You can override this protection by adding the
> >>> "overload-allowed"
> >>>>>>>>>>>>> option to your binding directive.
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
> >>>>>>>>> -cpus-per-proc
> >>>>>>>>>>> 4
> >>>>>>>>>>>>> mPre
> >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8
> > [hwt
> >>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12
> > [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16
> > [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20
> > [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24
> > [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28
> > [hwt
> >>>>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> >
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]>>>>>>>>>>>>>
 [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0
> > [hwt
> >>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4
> > [hwt
> >>>>> 0]],
> >>>>>>>>>>> socket
> >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>>>>>
> >>>>>>>
> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
Regards,
> >>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain
> > <rhc_at_[hidden]>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when I
> > had
> >>>>>>>>> time :-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'll update tomorrow.
> >>>>>>>>>>>>>> Ralph
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
> >>>>>>> <tmishima_at_[hidden]>wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Ralph,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in
> >>> oob_tcp.c
> >>>>>>> of
> >>>>>>>>>>>>>> openmpi-1.7.4a1r29646".
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I found the cause.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine
can
> >>> not.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Your host file:
> >>>>>>>>>>>>>> cat hosts
> >>>>>>>>>>>>>> bend001 slots=12
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My host file:
> >>>>>>>>>>>>>> cat hosts
> >>>>>>>>>>>>>> node08
> >>>>>>>>>>>>>> node08
> >>>>>>>>>>>>>> ...(total 8 lines)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line of
> > my
> >>>>>>>>> hostfile
> >>>>>>>>>>>>>> just before launching mpirun. Then it worked.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My host file(modified):
> >>>>>>>>>>>>>> cat hosts
> >>>>>>>>>>>>>> node08 slots=1
> >>>>>>>>>>>>>> node08 slots=1
> >>>>>>>>>>>>>> ...(total 8 lines)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference
> >>> between
> >>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
> >>>>> 1.7.4a1r29646.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> $ diff
> >>>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>
> > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> >>>>>>>>>>>>>> 394,401c394,399
> >>>>>>>>>>>>>> < if (got_count) {
> >>>>>>>>>>>>>> < node->slots_given = true;
> >>>>>>>>>>>>>> < } else if (got_max) {
> >>>>>>>>>>>>>> < node->slots = node->slots_max;
> >>>>>>>>>>>>>> < node->slots_given = true;
> >>>>>>>>>>>>>> < } else {
> >>>>>>>>>>>>>> < /* should be set by obj_new, but just to be
clear
> > */
> >>>>>>>>>>>>>> < node->slots_given = false;
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>> if (!got_count) {
> >>>>>>>>>>>>>>> if (got_max) {
> >>>>>>>>>>>>>>> node->slots = node->slots_max;
> >>>>>>>>>>>>>>> } else {
> >>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> ....
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative
> > trial.
> >>>>>>>>>>>>>> Then, it worked.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
> >>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>> 394 if (got_count) {
> >>>>>>>>>>>>>> 395 node->slots_given = true;
> >>>>>>>>>>>>>> 396 } else if (got_max) {
> >>>>>>>>>>>>>> 397 node->slots = node->slots_max;
> >>>>>>>>>>>>>> 398 node->slots_given = true;
> >>>>>>>>>>>>>> 399 } else {
> >>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be
> > clear
> >>>>> */
> >>>>>>>>>>>>>> 401 node->slots_given
> > = false;
> >>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */
> >>>>>>>>>>>>>> 403 }
> >>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please fix the problem properly, because it's just based
on
> > my
> >>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile
> > where
> >>>>>>> slots
> >>>>>>>>>>>>>> information is not given.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

> >
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>>>
> >>>>>>>
> > users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> users mailing list
> >>>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> users mailing list
> >>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> users_at_[hidden]
> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users