Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-12-10 21:28:19


On Dec 10, 2013, at 6:05 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Ralph,
>
> I tried again with -cpus-per-proc 2 as shown below.
> Here, I found that "-map-by socket:span" worked well.
>
> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
> -map-by socket:span myprog
> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]], socket
> 1[core 11[hwt 0]]: [./././././././.][././B/B
> /./././.][./././././././.][./././././././.]
> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]]: [./././././././.][./././.
> /./././.][B/B/./././././.][./././././././.]
> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]], socket
> 2[core 19[hwt 0]]: [./././././././.][./././.
> /./././.][././B/B/./././.][./././././././.]
> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]]: [./././././././.][./././.
> /./././.][./././././././.][B/B/./././././.]
> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]], socket
> 3[core 27[hwt 0]]: [./././././././.][./././.
> /./././.][./././././././.][././B/B/./././.]
> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket
> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> /././.][./././././././.][./././././././.]
> Hello world from process 1 of 8
> Hello world from process 0 of 8
> Hello world from process 4 of 8
> Hello world from process 2 of 8
> Hello world from process 7 of 8
> Hello world from process 6 of 8
> Hello world from process 5 of 8
> Hello world from process 3 of 8
> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
> -map-by socket myprog
> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket
> 0[core 7[hwt 0]]: [././././././B/B][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]], socket
> 1[core 11[hwt 0]]: [./././././././.][././B/B
> /./././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]]: [./././././././.][./././.
> /B/B/./.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]], socket
> 1[core 15[hwt 0]]: [./././././././.][./././.
> /././B/B][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket
> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> /././.][./././././././.][./././././././.]
> Hello world from process 5 of 8
> Hello world from process 1 of 8
> Hello world from process 6 of 8
> Hello world from process 4 of 8
> Hello world from process 2 of 8
> Hello world from process 0 of 8
> Hello world from process 7 of 8
> Hello world from process 3 of 8
>
> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
> In this case, I guess "-map-by socket:span" and "-map-by socket" has same
> meaning.
> Therefore, there's no problem about that. Sorry for distubing.

No problem - glad you could clear that up :-)

>
> By the way, through this test, I found another problem.
> Without torque manager and just using rsh, it causes the same error like
> below:
>
> [mishima_at_manage openmpi-1.7]$ rsh node03
> Last login: Wed Dec 11 09:42:02 from manage
> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> -map-by socket myprog

I don't understand the difference here - you are simply starting it from a different node? It looks like everything is expected to run local to mpirun, yes? So there is no rsh actually involved here. Are you still running in an allocation?

If you run this with "-host node03" on the cmd line, do you see the same problem?

> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
> Bind to: CORE
> Node: node03
> #processes: 2
> #cpus: 1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
> [mishima_at_node03 demos]$
> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> myprog
> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> ocket 1[core 11[hwt 0]]:
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]:
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> socket 2[core 19[hwt 0]]:
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> socket 2[core 23[hwt 0]]:
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> socket 3[core 27[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> socket 3[core 31[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> Hello world from process 4 of 8
> Hello world from process 2 of 8
> Hello world from process 6 of 8
> Hello world from process 5 of 8
> Hello world from process 3 of 8
> Hello world from process 7 of 8
> Hello world from process 0 of 8
> Hello world from process 1 of 8
>
> Regards,
> Tetsuya Mishima
>
>> Hmmm...that's strange. I only have 2 sockets on my system, but let me
> poke around a bit and see what might be happening.
>>
>> On Dec 10, 2013, at 4:47 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Hi Ralph,
>>>
>>> Thanks. I didn't know the meaning of "socket:span".
>>>
>>> But it still causes the problem, which seems socket:span doesn't work.
>>>
>>> [mishima_at_manage demos]$ qsub -I -l nodes=node03:ppn=32
>>> qsub: waiting for job 8265.manage.cluster to start
>>> qsub: job 8265.manage.cluster ready
>>>
>>> [mishima_at_node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>> [mishima_at_node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
>>> -map-by socket:span myprog
>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> socket
>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>> ocket 1[core 11[hwt 0]]:
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt 0]],
> socket
>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>> socket 1[core 15[hwt 0]]:
>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt 0]],
> socket
>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>> socket 2[core 19[hwt 0]]:
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt 0]],
> socket
>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>> socket 2[core 23[hwt 0]]:
>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt 0]],
> socket
>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>> socket 3[core 27[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt 0]],
> socket
>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>> socket 3[core 31[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]:
>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> socket
>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>> cket 0[core 7[hwt 0]]:
>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>> Hello world from process 0 of 8
>>> Hello world from process 3 of 8
>>> Hello world from process 1 of 8
>>> Hello world from process 4 of 8
>>> Hello world from process 6 of 8
>>> Hello world from process 5 of 8
>>> Hello world from process 2 of 8
>>> Hello world from process 7 of 8
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>>> No, that is actually correct. We map a socket until full, then move to
>>> the next. What you want is --map-by socket:span
>>>>
>>>> On Dec 10, 2013, at 3:42 PM, tmishima_at_[hidden] wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi Ralph,
>>>>>
>>>>> I had a time to try your patch yesterday using openmpi-1.7.4a1r29646.
>>>>>
>>>>> It stopped the error but unfortunately "mapping by socket" itself
>>> didn't
>>>>> work
>>>>> well as shown bellow:
>>>>>
>>>>> [mishima_at_manage demos]$ qsub -I -l nodes=1:ppn=32
>>>>> qsub: waiting for job 8260.manage.cluster to start
>>>>> qsub: job 8260.manage.cluster ready
>>>>>
>>>>> [mishima_at_node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>>>> [mishima_at_node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
> 4
>>>>> -map-by socket myprog
>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]],
>>> socket
>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>> ocket 1[core 11[hwt 0]]:
>>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]],
>>> socket
>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>> socket 1[core 15[hwt 0]]:
>>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]],
>>> socket
>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>> socket 2[core 19[hwt 0]]:
>>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]],
>>> socket
>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>> socket 2[core 23[hwt 0]]:
>>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]],
>>> socket
>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>> socket 3[core 27[hwt 0]]:
>>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]],
>>> socket
>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>> socket 3[core 31[hwt 0]]:
>>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>> socket
>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>> cket 0[core 3[hwt 0]]:
>>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]],
>>> socket
>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>> cket 0[core 7[hwt 0]]:
>>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>> Hello world from process 2 of 8
>>>>> Hello world from process 1 of 8
>>>>> Hello world from process 3 of 8
>>>>> Hello world from process 0 of 8
>>>>> Hello world from process 6 of 8
>>>>> Hello world from process 5 of 8
>>>>> Hello world from process 4 of 8
>>>>> Hello world from process 7 of 8
>>>>>
>>>>> I think this should be like this:
>>>>>
>>>>> rank 00
>>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>> rank 01
>>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>> rank 02
>>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>> ...
>>>>>
>>>>> Regards,
>>>>> Tetsuya Mishima
>>>>>
>>>>>> I fixed this under the trunk (was an issue regardless of RM) and
> have
>>>>> scheduled it for 1.7.4.
>>>>>>
>>>>>> Thanks!
>>>>>> Ralph
>>>>>>
>>>>>> On Nov 25, 2013, at 4:22 PM, tmishima_at_[hidden] wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Ralph,
>>>>>>>
>>>>>>> Thank you very much for your quick response.
>>>>>>>
>>>>>>> I'm afraid to say that I found one more issuse...
>>>>>>>
>>>>>>> It's not so serious. Please check it when you have a lot of time.
>>>>>>>
>>>>>>> The problem is cpus-per-proc with -map-by option under Torque
>>> manager.
>>>>>>> It doesn't work as shown below. I guess you can get the same
>>>>>>> behaviour under Slurm manager.
>>>>>>>
>>>>>>> Of course, if I remove -map-by option, it works quite well.
>>>>>>>
>>>>>>> [mishima_at_manage testbed2]$ qsub -I -l nodes=1:ppn=32
>>>>>>> qsub: waiting for job 8116.manage.cluster to start
>>>>>>> qsub: job 8116.manage.cluster ready
>>>>>>>
>>>>>>> [mishima_at_node03 ~]$ cd ~/Ducom/testbed2
>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
>>> -cpus-per-proc
>>>>> 4
>>>>>>> -map-by socket mPre
>>>>>>>
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>>>> A request was made to bind to that would result in binding more
>>>>>>> processes than cpus on a resource:
>>>>>>>
>>>>>>> Bind to: CORE
>>>>>>> Node: node03
>>>>>>> #processes: 2
>>>>>>> #cpus: 1
>>>>>>>
>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>> option to your binding directive.
>>>>>>>
>>>>>
>>>
> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> [mishima_at_node03 testbed2]$ mpirun -np 8 -report-bindings
>>> -cpus-per-proc
>>>>> 4
>>>>>>> mPre
>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]],
>>>>> socket
>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>>>>>> ocket 1[core 11[hwt 0]]:
>>>>>>>
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt
> 0]],
>>>>> socket
>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>>>>>> socket 1[core 15[hwt 0]]:
>>>>>>>
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt
> 0]],
>>>>> socket
>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>>>>>> socket 2[core 19[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt
> 0]],
>>>>> socket
>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>>>>>> socket 2[core 23[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt
> 0]],
>>>>> socket
>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>>>>>> socket 3[core 27[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt
> 0]],
>>>>> socket
>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>>>>>> socket 3[core 31[hwt 0]]:
>>>>>>>
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>>> socket
>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>> cket 0[core 3[hwt 0]]:
>>>>>>>
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]],
>>>>> socket
>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
>>>>>>> cket 0[core 7[hwt 0]]:
>>>>>>>
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
>>>>>>>
>>>>>>> Regards,
>>>>>>> Tetsuya Mishima
>>>>>>>
>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
>>>>>>>>
>>>>>>>> Thanks! That's precisely where I was going to look when I had
>>> time :-)
>>>>>>>>
>>>>>>>> I'll update tomorrow.
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
> <tmishima_at_[hidden]>wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Ralph,
>>>>>>>>
>>>>>>>> This is the continuous story of "Segmentation fault in oob_tcp.c
> of
>>>>>>>> openmpi-1.7.4a1r29646".
>>>>>>>>
>>>>>>>> I found the cause.
>>>>>>>>
>>>>>>>> Firstly, I noticed that your hostfile can work and mine can not.
>>>>>>>>
>>>>>>>> Your host file:
>>>>>>>> cat hosts
>>>>>>>> bend001 slots=12
>>>>>>>>
>>>>>>>> My host file:
>>>>>>>> cat hosts
>>>>>>>> node08
>>>>>>>> node08
>>>>>>>> ...(total 8 lines)
>>>>>>>>
>>>>>>>> I modified my script file to add "slots=1" to each line of my
>>> hostfile
>>>>>>>> just before launching mpirun. Then it worked.
>>>>>>>>
>>>>>>>> My host file(modified):
>>>>>>>> cat hosts
>>>>>>>> node08 slots=1
>>>>>>>> node08 slots=1
>>>>>>>> ...(total 8 lines)
>>>>>>>>
>>>>>>>> Secondary, I confirmed that there's a slight difference between
>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646.
>>>>>>>>
>>>>>>>> $ diff
>>>>>>>>
>>> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
>>>>>>>> 394,401c394,399
>>>>>>>> < if (got_count) {
>>>>>>>> < node->slots_given = true;
>>>>>>>> < } else if (got_max) {
>>>>>>>> < node->slots = node->slots_max;
>>>>>>>> < node->slots_given = true;
>>>>>>>> < } else {
>>>>>>>> < /* should be set by obj_new, but just to be clear */
>>>>>>>> < node->slots_given = false;
>>>>>>>> ---
>>>>>>>>> if (!got_count) {
>>>>>>>>> if (got_max) {
>>>>>>>>> node->slots = node->slots_max;
>>>>>>>>> } else {
>>>>>>>>> ++node->slots;
>>>>>>>>> }
>>>>>>>> ....
>>>>>>>>
>>>>>>>> Finally, I added the line 402 below just as a tentative trial.
>>>>>>>> Then, it worked.
>>>>>>>>
>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
>>>>>>>> ...
>>>>>>>> 394 if (got_count) {
>>>>>>>> 395 node->slots_given = true;
>>>>>>>> 396 } else if (got_max) {
>>>>>>>> 397 node->slots = node->slots_max;
>>>>>>>> 398 node->slots_given = true;
>>>>>>>> 399 } else {
>>>>>>>> 400 /* should be set by obj_new, but just to be clear */
>>>>>>>> 401 node->slots_given = false;
>>>>>>>> 402 ++node->slots; /* added by tmishima */
>>>>>>>> 403 }
>>>>>>>> ...
>>>>>>>>
>>>>>>>> Please fix the problem properly, because it's just based on my
>>>>>>>> random guess. It's related to the treatment of hostfile where
> slots
>>>>>>>> information is not given.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Tetsuya Mishima
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>>
>>>>>>>
>>>>>
>>>
> http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________
>
>>>
>>>>>
>>>>>>>
>>>>>>>> users mailing list
>>>>>>>>
> users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users