Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] hosfile issue of openmpi-1.7.4rc2
From: tmishima_at_[hidden]
Date: 2014-01-19 04:36:44


Thank you for your fix. I will try it tomorrow.

Before that, although I could not understand everything,
let me ask some questions about the new hostfile.c.

1. The line 244-248 is included in else-clause, which might cause
memory leak(it seems to me). Should it be out of the clause?

244 if (NULL != node_alias) {
245 /* add to list of aliases for this node - only add if
unique */
246 opal_argv_append_unique_nosize(&node->alias, node_alias,
false);
247 free(node_alias);
248 }

2. For the similar reason, should the line 306-314 be out of else-clause?

3. I think that node->slots_given of hosts detected through rank-file
should
always be true to avoid override by orte_set_dafault_slots. Should the line
305
be out of else-clause as well?

305 node->slots_given = true;

Regards,
Tetsuya Mishima

> I believe I now have this working correctly on the trunk and setup for
1.7.4. If you get a chance, please give it a try and confirm it solves the
problem.
>
> Thanks
> Ralph
>
> On Jan 17, 2014, at 2:16 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> > Sorry for delay - I understood and was just occupied with something
else for a while. Thanks for the follow-up. I'm looking at the issue and
trying to decipher the right solution.
> >
> >
> > On Jan 17, 2014, at 2:00 PM, tmishima_at_[hidden] wrote:
> >
> >>
> >>
> >> Hi Ralph,
> >>
> >> I'm sorry that my explanation was not enough ...
> >> This is the summary of my situation:
> >>
> >> 1. I create a hostfile as shown below manually.
> >>
> >> 2. I use mpirun to start the job without Torque, which means I'm
running in
> >> an un-managed environment.
> >>
> >> 3. Firstly, ORTE detects 8 slots on each host(maybe in
> >> "orte_ras_base_allocate").
> >> node05: slots=8 max_slots=0 slots_inuse=0
> >> node06: slots=8 max_slots=0 slots_inuse=0
> >>
> >> 4. Then, the code I identified is resetting the slot counts.
> >> node05: slots=1 max_slots=0 slots_inuse=0
> >> node06: slots=1 max_slots=0 slots_inuse=0
> >>
> >> 5. Therefore, ORTE believes that there is only one slot on each host.
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >>> No, I didn't use Torque this time.
> >>>
> >>> This issue is caused only when it is not in the managed
> >>> environment - namely, orte_managed_allocation is false
> >>> (and orte_set_slots is NULL).
> >>>
> >>> Under the torque management, it works fine.
> >>>
> >>> I hope you can understand the situation.
> >>>
> >>> Tetsuya Mishima
> >>>
> >>>> I'm sorry, but I'm really confused, so let me try to understand the
> >>> situation.
> >>>>
> >>>> You use Torque to get an allocation, so you are running in a managed
> >>> environment.
> >>>>
> >>>> You then use mpirun to start the job, but pass it a hostfile as
shown
> >>> below.
> >>>>
> >>>> Somehow, ORTE believes that there is only one slot on each host, and
> >> you
> >>> believe the code you've identified is resetting the slot counts.
> >>>>
> >>>> Is that a correct summary of the situation?
> >>>>
> >>>> Thanks
> >>>> Ralph
> >>>>
> >>>> On Jan 16, 2014, at 4:00 PM, tmishima_at_[hidden] wrote:
> >>>>
> >>>>>
> >>>>> Hi Ralph,
> >>>>>
> >>>>> I encountered the hostfile issue again where slots are counted by
> >>>>> listing the node multiple times. This should be fixed by r29765
> >>>>> - Fix hostfile parsing for the case where RMs count slots ....
> >>>>>
> >>>>> The difference is using RM or not. At that time, I executed mpirun
> >>> through
> >>>>> Torque manager. This time I executed it directly from command line
as
> >>>>> shown at the bottom, where node05,06 has 8 cores.
> >>>>>
> >>>>> Then, I checked source files arroud it and found that the line
> >> 151-160
> >>> in
> >>>>> plm_base_launch_support.c caused this issue. As node->slots is
> >> already
> >>>>> counted in hostfile.c @ r29765 even when node->slots_given is
false,
> >>>>> I think this part of plm_base_launch_support.c would be
unnecesarry.
> >>>>>
> >>>>> orte/mca/plm/base/plm_base_launch_support.c @ 30189:
> >>>>> 151 } else {
> >>>>> 152 /* set any non-specified slot counts to 1 */
> >>>>> 153 for (i=0; i < orte_node_pool->size; i++) {
> >>>>> 154 if (NULL == (node =
> >>>>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
> >>>>> 155 continue;
> >>>>> 156 }
> >>>>> 157 if (!node->slots_given) {
> >>>>> 158 node->slots = 1;
> >>>>> 159 }
> >>>>> 160 }
> >>>>> 161 }
> >>>>>
> >>>>> Removing this part, it works very well, where the function of
> >>>>> orte_set_default_slots is still alive. I think this would be better
> >> for
> >>>>> the compatible extention of openmpi-1.7.3.
> >>>>>
> >>>>> Regards,
> >>>>> Tetsuya Mishima
> >>>>>
> >>>>> [mishima_at_manage work]$ cat pbs_hosts
> >>>>> node05
> >>>>> node05
> >>>>> node05
> >>>>> node05
> >>>>> node05
> >>>>> node05
> >>>>> node05
> >>>>> node05
> >>>>> node06
> >>>>> node06
> >>>>> node06
> >>>>> node06
> >>>>> node06
> >>>>> node06
> >>>>> node06
> >>>>> node06
> >>>>> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
> >> -cpus-per-proc
> >>> 4
> >>>>> -report-bindings myprog
> >>>>> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]],
> >>> socket
> >>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>>>> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all
> >>> available
> >>>>> processors)
> >>>>> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> >>> socket
> >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>>>> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all
> >>> available
> >>>>> processors)
> >>>>> Hello world from process 0 of 4
> >>>>> Hello world from process 1 of 4
> >>>>> Hello world from process 3 of 4
> >>>>> Hello world from process 2 of 4
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users