Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] affinity issues under cpuset torque 1.8.1
From: Brock Palen (brockp_at_[hidden])
Date: 2014-06-25 15:39:55


Yes

ompi_info --all

Works,

ompi_info -param all all

[brockp_at_flux-login1 34241]$ ompi_info --param all all
Error getting SCIF driver version
                 MCA btl: parameter "btl_tcp_if_include" (current value: "",
                          data source: default, level: 1 user/basic, type:
                          string)
                          Comma-delimited list of devices and/or CIDR
                          notation of networks to use for MPI communication
                          (e.g., "eth0,192.168.0.0/16"). Mutually exclusive
                          with btl_tcp_if_exclude.
                 MCA btl: parameter "btl_tcp_if_exclude" (current value:
                          "127.0.0.1/8,sppp", data source: default, level: 1
                          user/basic, type: string)
                          Comma-delimited list of devices and/or CIDR
                          notation of networks to NOT use for MPI
                          communication -- all devices not matching these
                          specifications will be used (e.g.,
                          "eth0,192.168.0.0/16"). If set to a non-default
                          value, it is mutually exclusive with
                          btl_tcp_if_include.
[brockp_at_flux-login1 34241]$

ompi_info --param all all --level 9
(gives me what I expect).

Thanks,

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
brockp_at_[hidden]
(734)936-1985

On Jun 24, 2014, at 10:22 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:

> Brock --
>
> Can you run with "ompi_info --all"?
>
> With "--param all all", ompi_info in v1.8.x is defaulting to only showing level 1 MCA params. It's showing you all possible components and variables, but only level 1.
>
> Or you could also use "--level 9" to show all 9 levels. Here's the relevant section from the README:
>
> -----
> The following options may be helpful:
>
> --all Show a *lot* of information about your Open MPI
> installation.
> --parsable Display all the information in an easily
> grep/cut/awk/sed-able format.
> --param <framework> <component>
> A <framework> of "all" and a <component> of "all" will
> show all parameters to all components. Otherwise, the
> parameters of all the components in a specific framework,
> or just the parameters of a specific component can be
> displayed by using an appropriate <framework> and/or
> <component> name.
> --level <level>
> By default, ompi_info only shows "Level 1" MCA parameters
> -- parameters that can affect whether MPI processes can
> run successfully or not (e.g., determining which network
> interfaces to use). The --level option will display all
> MCA parameters from level 1 to <level> (the max <level>
> value is 9). Use "ompi_info --param <framework>
> <component> --level 9" to see *all* MCA parameters for a
> given component. See "The Modular Component Architecture
> (MCA)" section, below, for a fuller explanation.
> ----
>
>
>
>
> On Jun 24, 2014, at 5:19 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> That's odd - it shouldn't truncate the output. I'll take a look later today - we're all gathered for a developer's conference this week, so I'll be able to poke at this with Nathan.
>>
>>
>>
>> On Mon, Jun 23, 2014 at 3:15 PM, Brock Palen <brockp_at_[hidden]> wrote:
>> Perfection, flexible, extensible, so nice.
>>
>> BTW this doesn't happen older versions,
>>
>> [brockp_at_flux-login2 34241]$ ompi_info --param all all
>> Error getting SCIF driver version
>> MCA btl: parameter "btl_tcp_if_include" (current value: "",
>> data source: default, level: 1 user/basic, type:
>> string)
>> Comma-delimited list of devices and/or CIDR
>> notation of networks to use for MPI communication
>> (e.g., "eth0,192.168.0.0/16"). Mutually exclusive
>> with btl_tcp_if_exclude.
>> MCA btl: parameter "btl_tcp_if_exclude" (current value:
>> "127.0.0.1/8,sppp", data source: default, level: 1
>> user/basic, type: string)
>> Comma-delimited list of devices and/or CIDR
>> notation of networks to NOT use for MPI
>> communication -- all devices not matching these
>> specifications will be used (e.g.,
>> "eth0,192.168.0.0/16"). If set to a non-default
>> value, it is mutually exclusive with
>> btl_tcp_if_include.
>>
>>
>> This is normally much longer. And yes we don't have the PHI stuff installed on all nodes, strange that 'all all' is now very short, ompi_info -a still works though.
>>
>>
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> brockp_at_[hidden]
>> (734)936-1985
>>
>>
>>
>> On Jun 20, 2014, at 1:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by setting that param to 0
>>>
>>>
>>> On Jun 20, 2014, at 10:30 AM, Brock Palen <brockp_at_[hidden]> wrote:
>>>
>>>> Perfection! That appears to do it for our standard case.
>>>>
>>>> Now I know how to set MCA options by env var or config file. How can I make this the default, that then a user can override?
>>>>
>>>> Brock Palen
>>>> www.umich.edu/~brockp
>>>> CAEN Advanced Computing
>>>> XSEDE Campus Champion
>>>> brockp_at_[hidden]
>>>> (734)936-1985
>>>>
>>>>
>>>>
>>>> On Jun 20, 2014, at 1:21 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>>> I think I begin to grok at least part of the problem. If you are assigning different cpus on each node, then you'll need to tell us that by setting --hetero-nodes otherwise we won't have any way to report that back to mpirun for its binding calculation.
>>>>>
>>>>> Otherwise, we expect that the cpuset of the first node we launch a daemon onto (or where mpirun is executing, if we are only launching local to mpirun) accurately represents the cpuset on every node in the allocation.
>>>>>
>>>>> We still might well have a bug in our binding computation - but the above will definitely impact what you said the user did.
>>>>>
>>>>> On Jun 20, 2014, at 10:06 AM, Brock Palen <brockp_at_[hidden]> wrote:
>>>>>
>>>>>> Extra data point if I do:
>>>>>>
>>>>>> [brockp_at_nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
>>>>>> --------------------------------------------------------------------------
>>>>>> A request was made to bind to that would result in binding more
>>>>>> processes than cpus on a resource:
>>>>>>
>>>>>> Bind to: CORE
>>>>>> Node: nyx5513
>>>>>> #processes: 2
>>>>>> #cpus: 1
>>>>>>
>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>> option to your binding directive.
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> [brockp_at_nyx5508 34241]$ mpirun -H nyx5513 uptime
>>>>>> 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38
>>>>>> 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38
>>>>>> [brockp_at_nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
>>>>>> 0x00000010
>>>>>> 0x00001000
>>>>>> [brockp_at_nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
>>>>>> nyx5513
>>>>>> nyx5513
>>>>>>
>>>>>> Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu available, PBS says it gave it two, and if I force (this is all inside an interactive job) just on that node hwloc-bind --get I get what I expect,
>>>>>>
>>>>>> Is there a way to get a map of what MPI thinks it has on each host?
>>>>>>
>>>>>> Brock Palen
>>>>>> www.umich.edu/~brockp
>>>>>> CAEN Advanced Computing
>>>>>> XSEDE Campus Champion
>>>>>> brockp_at_[hidden]
>>>>>> (734)936-1985
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 20, 2014, at 12:38 PM, Brock Palen <brockp_at_[hidden]> wrote:
>>>>>>
>>>>>>> I was able to produce it in my test.
>>>>>>>
>>>>>>> orted affinity set by cpuset:
>>>>>>> [root_at_nyx5874 ~]# hwloc-bind --get --pid 103645
>>>>>>> 0x0000c002
>>>>>>>
>>>>>>> This mask (1, 14,15) which is across sockets, matches the cpu set setup by the batch system.
>>>>>>> [root_at_nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus
>>>>>>> 1,14-15
>>>>>>>
>>>>>>> The ranks though were then all set to the same core:
>>>>>>>
>>>>>>> [root_at_nyx5874 ~]# hwloc-bind --get --pid 103871
>>>>>>> 0x00008000
>>>>>>> [root_at_nyx5874 ~]# hwloc-bind --get --pid 103872
>>>>>>> 0x00008000
>>>>>>> [root_at_nyx5874 ~]# hwloc-bind --get --pid 103873
>>>>>>> 0x00008000
>>>>>>>
>>>>>>> Which is core 15:
>>>>>>>
>>>>>>> report-bindings gave me:
>>>>>>> You can see how a few nodes were bound to all the same core, the last one in each case. I only gave you the results for the hose nyx5874.
>>>>>>>
>>>>>>> [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all available processors)
>>>>>>> [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all available processors)
>>>>>>> [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all available processors)
>>>>>>> [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all available processors)
>>>>>>> [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all available processors)
>>>>>>> [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 59 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 60 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 56 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 57 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5798.engin.umich.edu:53026] MCW rank 58 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B]
>>>>>>> [nyx5545.engin.umich.edu:88170] MCW rank 2 is not bound (or bound to all available processors)
>>>>>>> [nyx5613.engin.umich.edu:25229] MCW rank 31 is not bound (or bound to all available processors)
>>>>>>> [nyx5880.engin.umich.edu:01406] MCW rank 10 is not bound (or bound to all available processors)
>>>>>>> [nyx5770.engin.umich.edu:86538] MCW rank 6 is not bound (or bound to all available processors)
>>>>>>> [nyx5613.engin.umich.edu:25228] MCW rank 30 is not bound (or bound to all available processors)
>>>>>>> [nyx5577.engin.umich.edu:65949] MCW rank 4 is not bound (or bound to all available processors)
>>>>>>> [nyx5607.engin.umich.edu:30379] MCW rank 14 is not bound (or bound to all available processors)
>>>>>>> [nyx5544.engin.umich.edu:72960] MCW rank 47 is not bound (or bound to all available processors)
>>>>>>> [nyx5544.engin.umich.edu:72959] MCW rank 46 is not bound (or bound to all available processors)
>>>>>>> [nyx5848.engin.umich.edu:04332] MCW rank 33 is not bound (or bound to all available processors)
>>>>>>> [nyx5848.engin.umich.edu:04333] MCW rank 34 is not bound (or bound to all available processors)
>>>>>>> [nyx5544.engin.umich.edu:72958] MCW rank 45 is not bound (or bound to all available processors)
>>>>>>> [nyx5858.engin.umich.edu:12165] MCW rank 35 is not bound (or bound to all available processors)
>>>>>>> [nyx5607.engin.umich.edu:30380] MCW rank 15 is not bound (or bound to all available processors)
>>>>>>> [nyx5544.engin.umich.edu:72957] MCW rank 44 is not bound (or bound to all available processors)
>>>>>>> [nyx5858.engin.umich.edu:12167] MCW rank 37 is not bound (or bound to all available processors)
>>>>>>> [nyx5870.engin.umich.edu:33811] MCW rank 7 is not bound (or bound to all available processors)
>>>>>>> [nyx5582.engin.umich.edu:81994] MCW rank 5 is not bound (or bound to all available processors)
>>>>>>> [nyx5848.engin.umich.edu:04331] MCW rank 32 is not bound (or bound to all available processors)
>>>>>>> [nyx5557.engin.umich.edu:46654] MCW rank 50 is not bound (or bound to all available processors)
>>>>>>> [nyx5858.engin.umich.edu:12166] MCW rank 36 is not bound (or bound to all available processors)
>>>>>>> [nyx5799.engin.umich.edu:67802] MCW rank 22 is not bound (or bound to all available processors)
>>>>>>> [nyx5799.engin.umich.edu:67803] MCW rank 23 is not bound (or bound to all available processors)
>>>>>>> [nyx5556.engin.umich.edu:50889] MCW rank 3 is not bound (or bound to all available processors)
>>>>>>> [nyx5625.engin.umich.edu:95931] MCW rank 53 is not bound (or bound to all available processors)
>>>>>>> [nyx5625.engin.umich.edu:95930] MCW rank 52 is not bound (or bound to all available processors)
>>>>>>> [nyx5557.engin.umich.edu:46655] MCW rank 51 is not bound (or bound to all available processors)
>>>>>>> [nyx5625.engin.umich.edu:95932] MCW rank 54 is not bound (or bound to all available processors)
>>>>>>> [nyx5625.engin.umich.edu:95933] MCW rank 55 is not bound (or bound to all available processors)
>>>>>>> [nyx5866.engin.umich.edu:16306] MCW rank 40 is not bound (or bound to all available processors)
>>>>>>> [nyx5861.engin.umich.edu:22761] MCW rank 61 is not bound (or bound to all available processors)
>>>>>>> [nyx5861.engin.umich.edu:22762] MCW rank 62 is not bound (or bound to all available processors)
>>>>>>> [nyx5861.engin.umich.edu:22763] MCW rank 63 is not bound (or bound to all available processors)
>>>>>>> [nyx5557.engin.umich.edu:46652] MCW rank 48 is not bound (or bound to all available processors)
>>>>>>> [nyx5557.engin.umich.edu:46653] MCW rank 49 is not bound (or bound to all available processors)
>>>>>>> [nyx5866.engin.umich.edu:16304] MCW rank 38 is not bound (or bound to all available processors)
>>>>>>> [nyx5788.engin.umich.edu:02465] MCW rank 20 is not bound (or bound to all available processors)
>>>>>>> [nyx5597.engin.umich.edu:68071] MCW rank 27 is not bound (or bound to all available processors)
>>>>>>> [nyx5775.engin.umich.edu:27952] MCW rank 17 is not bound (or bound to all available processors)
>>>>>>> [nyx5866.engin.umich.edu:16305] MCW rank 39 is not bound (or bound to all available processors)
>>>>>>> [nyx5788.engin.umich.edu:02466] MCW rank 21 is not bound (or bound to all available processors)
>>>>>>> [nyx5775.engin.umich.edu:27951] MCW rank 16 is not bound (or bound to all available processors)
>>>>>>> [nyx5597.engin.umich.edu:68073] MCW rank 29 is not bound (or bound to all available processors)
>>>>>>> [nyx5597.engin.umich.edu:68072] MCW rank 28 is not bound (or bound to all available processors)
>>>>>>> [nyx5552.engin.umich.edu:30481] MCW rank 12 is not bound (or bound to all available processors)
>>>>>>> [nyx5552.engin.umich.edu:30482] MCW rank 13 is not bound (or bound to all available processors)
>>>>>>>
>>>>>>>
>>>>>>> Brock Palen
>>>>>>> www.umich.edu/~brockp
>>>>>>> CAEN Advanced Computing
>>>>>>> XSEDE Campus Champion
>>>>>>> brockp_at_[hidden]
>>>>>>> (734)936-1985
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Jun 20, 2014, at 12:20 PM, Brock Palen <brockp_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> Got it,
>>>>>>>>
>>>>>>>> I have the input from the user and am testing it out.
>>>>>>>>
>>>>>>>> It probably has less todo with torque and more cpuset's,
>>>>>>>>
>>>>>>>> I'm working on producing it myself also.
>>>>>>>>
>>>>>>>> Brock Palen
>>>>>>>> www.umich.edu/~brockp
>>>>>>>> CAEN Advanced Computing
>>>>>>>> XSEDE Campus Champion
>>>>>>>> brockp_at_[hidden]
>>>>>>>> (734)936-1985
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jun 20, 2014, at 12:18 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> Thanks - I'm just trying to reproduce one problem case so I can look at it. Given that I don't have access to a Torque machine, I need to "fake" it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jun 20, 2014, at 9:15 AM, Brock Palen <brockp_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> In this case they are a single socket, but as you can see they could be ether/or depending on the job.
>>>>>>>>>>
>>>>>>>>>> Brock Palen
>>>>>>>>>> www.umich.edu/~brockp
>>>>>>>>>> CAEN Advanced Computing
>>>>>>>>>> XSEDE Campus Champion
>>>>>>>>>> brockp_at_[hidden]
>>>>>>>>>> (734)936-1985
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jun 19, 2014, at 2:44 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry, I should have been clearer - I was asking if cores 8-11 are all on one socket, or span multiple sockets
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jun 19, 2014, at 11:36 AM, Brock Palen <brockp_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>
>>>>>>>>>>>> It was a large job spread across. Our system allows users to ask for 'procs' which are laid out in any format.
>>>>>>>>>>>>
>>>>>>>>>>>> The list:
>>>>>>>>>>>>
>>>>>>>>>>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>>>>>>>>>>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>>>>>>>>>>>>> [nyx5409:11][nyx5411:11][nyx5412:3]
>>>>>>>>>>>>
>>>>>>>>>>>> Shows that nyx5406 had 2 cores, nyx5427 also 2, nyx5411 had 11.
>>>>>>>>>>>>
>>>>>>>>>>>> They could be spread across any number of sockets configuration. We start very lax "user requests X procs" and then the user can request more strict requirements from there. We support mostly serial users, and users can colocate on nodes.
>>>>>>>>>>>>
>>>>>>>>>>>> That is good to know, I think we would want to turn our default to 'bind to core' except for our few users who use hybrid mode.
>>>>>>>>>>>>
>>>>>>>>>>>> Our CPU set tells you what cores the job is assigned. So in the problem case provided, the cpuset/cgroup shows only cores 8-11 are available to this job on this node.
>>>>>>>>>>>>
>>>>>>>>>>>> Brock Palen
>>>>>>>>>>>> www.umich.edu/~brockp
>>>>>>>>>>>> CAEN Advanced Computing
>>>>>>>>>>>> XSEDE Campus Champion
>>>>>>>>>>>> brockp_at_[hidden]
>>>>>>>>>>>> (734)936-1985
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 18, 2014, at 11:10 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The default binding option depends on the number of procs - it is bind-to core for np=2, and bind-to socket for np > 2. You never said, but should I assume you ran 4 ranks? If so, then we should be trying to bind-to socket.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure what your cpuset is telling us - are you binding us to a socket? Are some cpus in one socket, and some in another?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It could be that the cpuset + bind-to socket is resulting in some odd behavior, but I'd need a little more info to narrow it down.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jun 18, 2014, at 7:48 PM, Brock Palen <brockp_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have started using 1.8.1 for some codes (meep in this case) and it sometimes works fine, but in a few cases I am seeing ranks being given overlapping CPU assignments, not always though.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Example job, default binding options (so by-core right?):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Assigned nodes, the one in question is nyx5398, we use torque CPU sets, and use TM to spawn.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [nyx5406:2][nyx5427:2][nyx5506:2][nyx5311:3]
>>>>>>>>>>>>>> [nyx5329:4][nyx5398:4][nyx5396:11][nyx5397:11]
>>>>>>>>>>>>>> [nyx5409:11][nyx5411:11][nyx5412:3]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16065
>>>>>>>>>>>>>> 0x00000200
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16066
>>>>>>>>>>>>>> 0x00000800
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16067
>>>>>>>>>>>>>> 0x00000200
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16068
>>>>>>>>>>>>>> 0x00000800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# cat /dev/cpuset/torque/12703230.nyx.engin.umich.edu/cpus
>>>>>>>>>>>>>> 8-11
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So torque claims the CPU set setup for the job has 4 cores, but as you can see the ranks were giving identical binding.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I checked the pids they were part of the correct CPU set, I also checked, orted:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# hwloc-bind --get --pid 16064
>>>>>>>>>>>>>> 0x00000f00
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# hwloc-calc --intersect PU 16064
>>>>>>>>>>>>>> ignored unrecognized argument 16064
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root_at_nyx5398 ~]# hwloc-calc --intersect PU 0x00000f00
>>>>>>>>>>>>>> 8,9,10,11
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Which is exactly what I would expect.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So ummm, i'm lost why this might happen? What else should I check? Like I said not all jobs show this behavior.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brock Palen
>>>>>>>>>>>>>> www.umich.edu/~brockp
>>>>>>>>>>>>>> CAEN Advanced Computing
>>>>>>>>>>>>>> XSEDE Campus Champion
>>>>>>>>>>>>>> brockp_at_[hidden]
>>>>>>>>>>>>>> (734)936-1985
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24672.php
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24673.php
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24675.php
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24676.php
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24677.php
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24678.php
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24681.php
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24682.php
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24683.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24684.php
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24690.php
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24694.php
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24696.php