Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] hosfile issue of openmpi-1.7.4rc2
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-18 11:09:18


I believe I now have this working correctly on the trunk and setup for 1.7.4. If you get a chance, please give it a try and confirm it solves the problem.

Thanks
Ralph

On Jan 17, 2014, at 2:16 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Sorry for delay - I understood and was just occupied with something else for a while. Thanks for the follow-up. I'm looking at the issue and trying to decipher the right solution.
>
>
> On Jan 17, 2014, at 2:00 PM, tmishima_at_[hidden] wrote:
>
>>
>>
>> Hi Ralph,
>>
>> I'm sorry that my explanation was not enough ...
>> This is the summary of my situation:
>>
>> 1. I create a hostfile as shown below manually.
>>
>> 2. I use mpirun to start the job without Torque, which means I'm running in
>> an un-managed environment.
>>
>> 3. Firstly, ORTE detects 8 slots on each host(maybe in
>> "orte_ras_base_allocate").
>> node05: slots=8 max_slots=0 slots_inuse=0
>> node06: slots=8 max_slots=0 slots_inuse=0
>>
>> 4. Then, the code I identified is resetting the slot counts.
>> node05: slots=1 max_slots=0 slots_inuse=0
>> node06: slots=1 max_slots=0 slots_inuse=0
>>
>> 5. Therefore, ORTE believes that there is only one slot on each host.
>>
>> Regards,
>> Tetsuya Mishima
>>
>>> No, I didn't use Torque this time.
>>>
>>> This issue is caused only when it is not in the managed
>>> environment - namely, orte_managed_allocation is false
>>> (and orte_set_slots is NULL).
>>>
>>> Under the torque management, it works fine.
>>>
>>> I hope you can understand the situation.
>>>
>>> Tetsuya Mishima
>>>
>>>> I'm sorry, but I'm really confused, so let me try to understand the
>>> situation.
>>>>
>>>> You use Torque to get an allocation, so you are running in a managed
>>> environment.
>>>>
>>>> You then use mpirun to start the job, but pass it a hostfile as shown
>>> below.
>>>>
>>>> Somehow, ORTE believes that there is only one slot on each host, and
>> you
>>> believe the code you've identified is resetting the slot counts.
>>>>
>>>> Is that a correct summary of the situation?
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>> On Jan 16, 2014, at 4:00 PM, tmishima_at_[hidden] wrote:
>>>>
>>>>>
>>>>> Hi Ralph,
>>>>>
>>>>> I encountered the hostfile issue again where slots are counted by
>>>>> listing the node multiple times. This should be fixed by r29765
>>>>> - Fix hostfile parsing for the case where RMs count slots ....
>>>>>
>>>>> The difference is using RM or not. At that time, I executed mpirun
>>> through
>>>>> Torque manager. This time I executed it directly from command line as
>>>>> shown at the bottom, where node05,06 has 8 cores.
>>>>>
>>>>> Then, I checked source files arroud it and found that the line
>> 151-160
>>> in
>>>>> plm_base_launch_support.c caused this issue. As node->slots is
>> already
>>>>> counted in hostfile.c @ r29765 even when node->slots_given is false,
>>>>> I think this part of plm_base_launch_support.c would be unnecesarry.
>>>>>
>>>>> orte/mca/plm/base/plm_base_launch_support.c @ 30189:
>>>>> 151 } else {
>>>>> 152 /* set any non-specified slot counts to 1 */
>>>>> 153 for (i=0; i < orte_node_pool->size; i++) {
>>>>> 154 if (NULL == (node =
>>>>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
>>>>> 155 continue;
>>>>> 156 }
>>>>> 157 if (!node->slots_given) {
>>>>> 158 node->slots = 1;
>>>>> 159 }
>>>>> 160 }
>>>>> 161 }
>>>>>
>>>>> Removing this part, it works very well, where the function of
>>>>> orte_set_default_slots is still alive. I think this would be better
>> for
>>>>> the compatible extention of openmpi-1.7.3.
>>>>>
>>>>> Regards,
>>>>> Tetsuya Mishima
>>>>>
>>>>> [mishima_at_manage work]$ cat pbs_hosts
>>>>> node05
>>>>> node05
>>>>> node05
>>>>> node05
>>>>> node05
>>>>> node05
>>>>> node05
>>>>> node05
>>>>> node06
>>>>> node06
>>>>> node06
>>>>> node06
>>>>> node06
>>>>> node06
>>>>> node06
>>>>> node06
>>>>> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
>> -cpus-per-proc
>>> 4
>>>>> -report-bindings myprog
>>>>> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]],
>>> socket
>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all
>>> available
>>>>> processors)
>>>>> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>> socket
>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all
>>> available
>>>>> processors)
>>>>> Hello world from process 0 of 4
>>>>> Hello world from process 1 of 4
>>>>> Hello world from process 3 of 4
>>>>> Hello world from process 2 of 4
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>