Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] default num_procs of round_robin_mapper with cpus-per-proc option
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-25 11:47:04


Been tied up the last few days, but I did spend some time thinking about this some more - and I think I'm going to leave the current behavior as-is, adding a check to see if you specify map-by core along with cpus-per-proc to generate an error in that situation. My reasoning is that map-by core is a very specific directive - you are telling me to map each process to a specific core. If you then tell me to bind that process to multiple cpus, you are creating an inherent conflict that I don't readily know how to resolve.

IMO, the best solution is to generate an error and suggest you map-by slot instead. This frees me to bind as many cpus to that allocated slot as you care to specify, and removes the conflict.

HTH
Ralph

On Jan 22, 2014, at 9:37 PM, tmishima_at_[hidden] wrote:

>
>
> Thanks for your explanation, Ralph.
>
> But it's really subtle to understand for me ...
> Anyway, I'd like to report what I found through verbose output.
>
> "-map-by core" calls "bind in place" as below:
> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings
> -cpus-per-proc 4 -map-by core -mca rmaps_base_v
> erbose 10 ~/mis/openmpi/demos/myprog
> ...
> [manage.cluster:11362] mca:rmaps: compute bindings for job [8729,1] with
> policy CORE
> [manage.cluster:11362] mca:rmaps: bindings for job [8729,1] - core to core
> [manage.cluster:11362] mca:rmaps: bind in place for job [8729,1] with
> bindings CORE
> ...
>
> On the other hand, "-map-by slot" calls "bind downward" as below:
> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings
> -cpus-per-proc 4 -map-by slot -mca rmaps_base_v
> erbose 10 ~/mis/openmpi/demos/myprog
> ...
> [manage.cluster:12032] mca:rmaps: compute bindings for job [8571,1] with
> policy CORE
> [manage.cluster:12032] mca:rmaps: bind downward for job [8571,1] with
> bindings CORE
> ...
>
> I think your best guess is right and something is wrong with
> bind_in_place function. I have to say the logic of source code
> is so complex that I could not figure it out.
>
> Regards,
> Tetsuya Mishima
>
>> On Jan 22, 2014, at 8:08 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Thanks, Ralph.
>>>
>>> I have one more question. I'm sorry to ask you many things ...
>>
>> Not a problem
>>
>>>
>>> Could you tell me the difference between "map-by slot" and "map-by
> core".
>>> From my understanding, slot is the synonym of core.
>>
>> Not really - see below
>>
>>> But those behaviors
>>> using openmpi-1.7.4rc2 with the cpus-per-proc option are quite
> different
>>> as shown below. I tried to browse the source code but I could not make
> it
>>> clear so far.
>>>
>>
>> It is a little subtle, I fear. When you tell us "map-by slot", we assign
> each process to an allocated slot without associating it to any specific
> cpu or core. When we then bind to core (as we do by
>> default), we balance the binding across the sockets to improve
> performance.
>>
>> When you tell us "map-by core", then we directly associate each process
> with a specific core. So when we bind, we bind you to that core. This will
> cause us to fully use all the cores on the first
>> socket before we move to the next.
>>
>> I'm a little puzzled by your output as it appears that cpus-per-proc was
> ignored, so that's something I'd have to look at more carefully. Best guess
> is that we aren't skipping cores to account for
>> the cpus-per-core setting, and thus the procs are being mapped to
> consecutive cores - which wouldn't be very good if we then bound them to
> multiple neighboring cores as they'd fall on top of each
>> other.
>>
>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>> [ un-managed environment] (node05,06 has 8 cores each)
>>>
>>> [mishima_at_manage work]$ cat pbs_hosts
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
> -report-bindings
>>> -cpus-per-proc 4 -map-by slot ~/mis/openmpi/dem
>>> os/myprog
>>> [node05.cluster:23949] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>> [node05.cluster:23949] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>> [node06.cluster:22139] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>> [node06.cluster:22139] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>> Hello world from process 0 of 4
>>> Hello world from process 1 of 4
>>> Hello world from process 3 of 4
>>> Hello world from process 2 of 4
>>> [mishima_at_manage work]$ mpirun -np 4 -hostfile pbs_hosts
> -report-bindings
>>> -cpus-per-proc 4 -map-by core ~/mis/openmpi/dem
>>> os/myprog
>>> [node05.cluster:23985] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
>>> [./B/./.][./././.]
>>> [node05.cluster:23985] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
>>> [B/././.][./././.]
>>> [node06.cluster:22175] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
>>> [./B/./.][./././.]
>>> [node06.cluster:22175] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
>>> [B/././.][./././.]
>>> Hello world from process 2 of 4
>>> Hello world from process 3 of 4
>>> Hello world from process 0 of 4
>>> Hello world from process 1 of 4
>>>
>>> (note) I have the same behavior in the managed environment by Torque
>>>
>>>> Seems like a reasonable, minimal risk request - will do
>>>>
>>>> On Jan 22, 2014, at 4:28 PM, tmishima_at_[hidden] wrote:
>>>>
>>>>>
>>>>> Hi Ralph, I want to ask you one more thing about default setting of
>>>>> num_procs
>>>>> when we don't specify the -np option and we set the cpus-per-proc >
> 1.
>>>>>
>>>>> In this case, the round_robin_mapper sets num_procs = num_slots as
>>> below:
>>>>>
>>>>> rmaps_rr.c:
>>>>> 130 if (0 == app->num_procs) {
>>>>> 131 /* set the num_procs to equal the number of slots on
>>> these
>>>>> mapped nodes */
>>>>> 132 app->num_procs = num_slots;
>>>>> 133 }
>>>>>
>>>>> However, because of cpus_per_rank > 1, this num_procs will be refused
>>> at
>>>>> the
>>>>> line 61 in rmaps_rr_mappers.c as below, unless we switch on the
>>>>> oversubscribe
>>>>> directive.
>>>>>
>>>>> rmaps_rr_mappers.c:
>>>>> 61 if (num_slots < ((int)app->num_procs *
>>>>> orte_rmaps_base.cpus_per_rank)) {
>>>>> 62 if (ORTE_MAPPING_NO_OVERSUBSCRIBE &
>>> ORTE_GET_MAPPING_DIRECTIVE
>>>>> (jdata->map->mapping)) {
>>>>> 63 orte_show_help("help-orte-rmaps-base.txt",
>>>>> "orte-rmaps-base:alloc-error",
>>>>> 64 true, app->num_procs, app->app);
>>>>> 65 return ORTE_ERR_SILENT;
>>>>> 66 }
>>>>> 67 }
>>>>>
>>>>> Therefore, I think the default num_procs should be equal to the
> number
>>> of
>>>>> num_slots divided by cpus/rank:
>>>>>
>>>>> app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank;
>>>>>
>>>>> This would be more convinient for most of people who want to use the
>>>>> -cpus-per-proc option. I already confirmed it worked well. Please
>>> consider
>>>>> to apply this fix to 1.7.4.
>>>>>
>>>>> Regards,
>>>>> Tetsuya Mishima
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users