Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] default num_procs of round_robin_mapper with cpus-per-proc option
From: tmishima_at_[hidden]
Date: 2014-01-25 14:21:10


Hi Ralph, Thank you for your comment.

I agree with your conclusion that you leave it as it is.

As far as I checked, this behavior will also happen when
I try to bind-to the objects which are smaller than
ncpus-per-proc, ie, l1cache, l2cache and so on.

So, if it is easy to know the number of cores included
in the objects, it's better to compare the size and
ncpu-per-proc, and generate error & suggestion in such
a situation.

Regards,
Tetsuya Mishima

> Been tied up the last few days, but I did spend some time thinking about
this some more - and I think I'm going to leave the current behavior as-is,
adding a check to see if you specify map-by core
> along with cpus-per-proc to generate an error in that situation. My
reasoning is that map-by core is a very specific directive - you are
telling me to map each process to a specific core. If you then
> tell me to bind that process to multiple cpus, you are creating an
inherent conflict that I don't readily know how to resolve.
>
> IMO, the best solution is to generate an error and suggest you map-by
slot instead. This frees me to bind as many cpus to that allocated slot as
you care to specify, and removes the conflict.
>
> HTH
> Ralph
>
> On Jan 22, 2014, at 9:37 PM, tmishima_at_[hidden] wrote:
>
> >
> >
> > Thanks for your explanation, Ralph.
> >
> > But it's really subtle to understand for me ...
> > Anyway, I'd like to report what I found through verbose output.
> >
> > "-map-by core" calls "bind in place" as below:
> > [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
-report-bindings
> > -cpus-per-proc 4 -map-by core -mca rmaps_base_v
> > erbose 10 ~/mis/openmpi/demos/myprog
> > ...
> > [manage.cluster:11362] mca:rmaps: compute bindings for job [8729,1]
with
> > policy CORE
> > [manage.cluster:11362] mca:rmaps: bindings for job [8729,1] - core to
core
> > [manage.cluster:11362] mca:rmaps: bind in place for job [8729,1] with
> > bindings CORE
> > ...
> >
> > On the other hand, "-map-by slot" calls "bind downward" as below:
> > [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
-report-bindings
> > -cpus-per-proc 4 -map-by slot -mca rmaps_base_v
> > erbose 10 ~/mis/openmpi/demos/myprog
> > ...
> > [manage.cluster:12032] mca:rmaps: compute bindings for job [8571,1]
with
> > policy CORE
> > [manage.cluster:12032] mca:rmaps: bind downward for job [8571,1] with
> > bindings CORE
> > ...
> >
> > I think your best guess is right and something is wrong with
> > bind_in_place function. I have to say the logic of source code
> > is so complex that I could not figure it out.
> >
> > Regards,
> > Tetsuya Mishima
> >
> >> On Jan 22, 2014, at 8:08 PM, tmishima_at_[hidden] wrote:
> >>
> >>>
> >>>
> >>> Thanks, Ralph.
> >>>
> >>> I have one more question. I'm sorry to ask you many things ...
> >>
> >> Not a problem
> >>
> >>>
> >>> Could you tell me the difference between "map-by slot" and "map-by
> > core".
> >>> From my understanding, slot is the synonym of core.
> >>
> >> Not really - see below
> >>
> >>> But those behaviors
> >>> using openmpi-1.7.4rc2 with the cpus-per-proc option are quite
> > different
> >>> as shown below. I tried to browse the source code but I could not
make
> > it
> >>> clear so far.
> >>>
> >>
> >> It is a little subtle, I fear. When you tell us "map-by slot", we
assign
> > each process to an allocated slot without associating it to any
specific
> > cpu or core. When we then bind to core (as we do by
> >> default), we balance the binding across the sockets to improve
> > performance.
> >>
> >> When you tell us "map-by core", then we directly associate each
process
> > with a specific core. So when we bind, we bind you to that core. This
will
> > cause us to fully use all the cores on the first
> >> socket before we move to the next.
> >>
> >> I'm a little puzzled by your output as it appears that cpus-per-proc
was
> > ignored, so that's something I'd have to look at more carefully. Best
guess
> > is that we aren't skipping cores to account for
> >> the cpus-per-core setting, and thus the procs are being mapped to
> > consecutive cores - which wouldn't be very good if we then bound them
to
> > multiple neighboring cores as they'd fall on top of each
> >> other.
> >>
> >>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
> >>> [ un-managed environment] (node05,06 has 8 cores each)
> >>>
> >>> [mishima_at_manage work]$ cat pbs_hosts
> >>> node05
> >>> node05
> >>> node05
> >>> node05
> >>> node05
> >>> node05
> >>> node05
> >>> node05
> >>> node06
> >>> node06
> >>> node06
> >>> node06
> >>> node06
> >>> node06
> >>> node06
> >>> node06
> >>> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
> > -report-bindings
> >>> -cpus-per-proc 4 -map-by slot ~/mis/openmpi/dem
> >>> os/myprog
> >>> [node05.cluster:23949] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> > socket
> >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>> [node05.cluster:23949] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>> [node06.cluster:22139] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> > socket
> >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> >>> [node06.cluster:22139] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> >>> Hello world from process 0 of 4
> >>> Hello world from process 1 of 4
> >>> Hello world from process 3 of 4
> >>> Hello world from process 2 of 4
> >>> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
> > -report-bindings
> >>> -cpus-per-proc 4 -map-by core ~/mis/openmpi/dem
> >>> os/myprog
> >>> [node05.cluster:23985] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
> >>> [./B/./.][./././.]
> >>> [node05.cluster:23985] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> >>> [B/././.][./././.]
> >>> [node06.cluster:22175] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
> >>> [./B/./.][./././.]
> >>> [node06.cluster:22175] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
> >>> [B/././.][./././.]
> >>> Hello world from process 2 of 4
> >>> Hello world from process 3 of 4
> >>> Hello world from process 0 of 4
> >>> Hello world from process 1 of 4
> >>>
> >>> (note) I have the same behavior in the managed environment by Torque
> >>>
> >>>> Seems like a reasonable, minimal risk request - will do
> >>>>
> >>>> On Jan 22, 2014, at 4:28 PM, tmishima_at_[hidden] wrote:
> >>>>
> >>>>>
> >>>>> Hi Ralph, I want to ask you one more thing about default setting of
> >>>>> num_procs
> >>>>> when we don't specify the -np option and we set the cpus-per-proc >
> > 1.
> >>>>>
> >>>>> In this case, the round_robin_mapper sets num_procs = num_slots as
> >>> below:
> >>>>>
> >>>>> rmaps_rr.c:
> >>>>> 130 if (0 == app->num_procs) {
> >>>>> 131 /* set the num_procs to equal the number of slots on
> >>> these
> >>>>> mapped nodes */
> >>>>> 132 app->num_procs = num_slots;
> >>>>> 133 }
> >>>>>
> >>>>> However, because of cpus_per_rank > 1, this num_procs will be
refused
> >>> at
> >>>>> the
> >>>>> line 61 in rmaps_rr_mappers.c as below, unless we switch on the
> >>>>> oversubscribe
> >>>>> directive.
> >>>>>
> >>>>> rmaps_rr_mappers.c:
> >>>>> 61 if (num_slots < ((int)app->num_procs *
> >>>>> orte_rmaps_base.cpus_per_rank)) {
> >>>>> 62 if (ORTE_MAPPING_NO_OVERSUBSCRIBE &
> >>> ORTE_GET_MAPPING_DIRECTIVE
> >>>>> (jdata->map->mapping)) {
> >>>>> 63 orte_show_help("help-orte-rmaps-base.txt",
> >>>>> "orte-rmaps-base:alloc-error",
> >>>>> 64 true, app->num_procs, app->app);
> >>>>> 65 return ORTE_ERR_SILENT;
> >>>>> 66 }
> >>>>> 67 }
> >>>>>
> >>>>> Therefore, I think the default num_procs should be equal to the
> > number
> >>> of
> >>>>> num_slots divided by cpus/rank:
> >>>>>
> >>>>> app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank;
> >>>>>
> >>>>> This would be more convinient for most of people who want to use
the
> >>>>> -cpus-per-proc option. I already confirmed it worked well. Please
> >>> consider
> >>>>> to apply this fix to 1.7.4.
> >>>>>
> >>>>> Regards,
> >>>>> Tetsuya Mishima
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users