Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-06-06 21:25:35


Hmmm....Tetsuya is quite correct. Afraid I got distracted by the segfault (still investigating that one). Our default policy for 2 processes is to map-by core, and that would indeed fail when cpus-per-proc > 1. However, that seems like a non-intuitive requirement, so let me see if I can make this be a little more user-friendly.

On Jun 6, 2014, at 2:25 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Dan,
>
> Please try:
> mpirun -np 2 --map-by socket:pe=8 ./hello
> or
> mpirun -np 2 --map-by slot:pe=8 ./hello
>
> You can not bind 8 cpus to the object "core" which has
> only one cpu. This limitation started from 1.8 series.
>
> The objcet "socket" has 8 cores in your case. So you
> can do it. And, the object "slot" is almost same as the
> "core" but it can exceed the limitation, because it's a
> fictitious object which has no size.
>
> Regards,
> Tetsuya Mishima
>
>
>> No problem -
>>
>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
>> 2 per node, 8 cores each. No threading enabled.
>>
>> $ lstopo
>> Machine (64GB)
>> NUMANode L#0 (P#0 32GB)
>> Socket L#0 + L3 L#0 (20MB)
>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0
> (P#0)
>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1
> (P#1)
>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2
> (P#2)
>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3
> (P#3)
>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4
> (P#4)
>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5
> (P#5)
>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6
> (P#6)
>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7
> (P#7)
>> HostBridge L#0
>> PCIBridge
>> PCI 1000:0087
>> Block L#0 "sda"
>> PCIBridge
>> PCI 8086:2250
>> PCIBridge
>> PCI 8086:1521
>> Net L#1 "eth0"
>> PCI 8086:1521
>> Net L#2 "eth1"
>> PCIBridge
>> PCI 102b:0533
>> PCI 8086:1d02
>> NUMANode L#1 (P#1 32GB)
>> Socket L#1 + L3 L#1 (20MB)
>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8
> (P#8)
>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9
> (P#9)
>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>> + PU L#10 (P#10)
>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>> + PU L#11 (P#11)
>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>> + PU L#12 (P#12)
>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>> + PU L#13 (P#13)
>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>> + PU L#14 (P#14)
>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>> + PU L#15 (P#15)
>> HostBridge L#5
>> PCIBridge
>> PCI 15b3:1011
>> Net L#3 "ib0"
>> OpenFabrics L#4 "mlx5_0"
>> PCIBridge
>> PCI 8086:2250
>>
>> From the segfault below. I tried reproducing the crash on less than an
>> 4 node allocation but wasn't able to.
>>
>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ mpirun -np 2
>> -machinefile ./nodes -mca plm_base_verbose 10 ./hello
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> registering plm components
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> found loaded component isolated
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> component isolated has no register or open function
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> found loaded component slurm
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> component slurm register function successful
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> found loaded component rsh
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> component rsh register function successful
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> found loaded component tm
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>> component tm register function successful
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: opening
>> plm components
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>> loaded component isolated
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>> component isolated open function successful
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>> loaded component slurm
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>> component slurm open function successful
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>> loaded component rsh
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>> component rsh open function successful
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>> loaded component tm
>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>> component tm open function successful
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select: Auto-selecting plm
>> components
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>> component [isolated]
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of
>> component [isolated] set priority to 0
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>> component [slurm]
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Skipping
>> component [slurm]. Query failed to return a module
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>> component [rsh]
>> [conte-a009.rcac.purdue.edu:55685] [[INVALID],INVALID] plm:rsh_lookup
>> on agent ssh : rsh path NULL
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of
>> component [rsh] set priority to 10
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>> component [tm]
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of
>> component [tm] set priority to 75
>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Selected
>> component [tm]
>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component isolated
> closed
>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading
>> component isolated
>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component slurm
> closed
>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading component
> slurm
>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component rsh closed
>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading component
> rsh
>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: initial bias
>> 55685 nodename hash 3965217721
>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: final jobfam
> 24164
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:receive start
> comm
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_job
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm
> creating map
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm add
>> new daemon [[24164,0],1]
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm
>> assigning new daemon [[24164,0],1] to node conte-a055
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching vm
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: final top-level
> argv:
>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid
>> <template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>
> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380"
>
>> -mca plm_base_verbose 10
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting
>> LD_LIBRARY_PATH:
>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/usr/
>
>>
> pbs/lib:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mpirt/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/ipp/lib/intel64:/a
>
>>
> pps/rhel6/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144
>
>> /tbb/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/lib64
>
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting
>> PATH:
>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/apps
>
>> /rhel6/intel/composer_xe_2013_sp1.2.144/bin/intel64:/opt/intel/mic/bin:/apps/rhel6/intel/inspector_xe_2013/bin64:/apps/rhel6/intel/advisor_xe_2013/bin64:/apps/rhel6/intel/vtune_amplifier_xe_2013/bin64
>
>> :/apps/rhel6/intel/opencl-1.2-3.2.1.16712/bin:/usr/lib64/qt-3.3
>> /bin:/opt/moab/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/hpss/bin:/opt/hsi/bin:/opt/ibutils/bin:/usr/pbs/bin:/opt/moab/bin:/usr/site/rcac/scripts:/usr/site/rcac/support_scripts:/usr/site/
>
>> rcac/bin:/usr/site/rcac/sbin:/usr/sbin
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching on
>> node conte-a055
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: executing:
>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid 1
>> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>
> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380"
>
>> -mca plm_base_verbose 10
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm:launch:
>> finished spawning orteds
>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>> registering plm components
>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>> found loaded component rsh
>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>> component rsh register function successful
>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: opening
>> plm components
>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: found
>> loaded component rsh
>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open:
>> component rsh open function successful
>> [conte-a055.rcac.purdue.edu:32094] mca:base:select: Auto-selecting plm
>> components
>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Querying
>> component [rsh]
>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_lookup on
>> agent ssh : rsh path NULL
>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Query of
>> component [rsh] set priority to 10
>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Selected
>> component [rsh]
>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_setup on
>> agent ssh : rsh path NULL
>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive start
> comm
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>> plm:base:orted_report_launch from daemon [[24164,0],1]
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>> plm:base:orted_report_launch from daemon [[24164,0],1] on node
>> conte-a055
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] RECEIVED TOPOLOGY
>> FROM NODE conte-a055
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] NEW TOPOLOGY - ADDING
>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>> plm:base:orted_report_launch completed for daemon [[24164,0],1] at
>> contact
> 1583611904.1;tcp://172.18.96.95,172.31.1.254,172.31.2.254,172.18.112.95:58312
>
>> [conte-a009:55685] *** Process received signal ***
>> [conte-a009:55685] Signal: Segmentation fault (11)
>> [conte-a009:55685] Signal code: Address not mapped (1)
>> [conte-a009:55685] Failing at address: 0x4c
>> [conte-a009:55685] [ 0] /lib64/libpthread.so.0[0x327f80f500]
>> [conte-a009:55685] [ 1]
>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-rte.so.7
> (orte_plm_base_complete_setup+0x951)[0x2b5b069a50e1]
>> [conte-a009:55685] [ 2]
>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6
> (opal_libevent2021_event_base_loop+0xa05)[0x2b5b075ff145]
>> [conte-a009:55685] [ 3] mpirun(orterun+0x1ffd)[0x4073b5]
>> [conte-a009:55685] [ 4] mpirun(main+0x20)[0x4048f4]
>> [conte-a009:55685] [ 5] /lib64/libc.so.6(__libc_start_main
> +0xfd)[0x327f41ecdd]
>> [conte-a009:55685] [ 6] mpirun[0x404819]
>> [conte-a009:55685] *** End of error message ***
>> Segmentation fault (core dumped)
>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$
>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive stop
>> comm
>>
>> On Fri, Jun 6, 2014 at 3:00 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> Sorry to pester with questions, but I'm trying to narrow down the
> issue.
>>>
>>> * What kind of chips are on these machines?
>>>
>>> * If they have h/w threads, are they enabled?
>>>
>>> * you might have lstopo on one of those machines - could you pass along
> its output? Otherwise, you can run a simple "mpirun -n 1 -mca
> ess_base_verbose 20 hostname" and it will print out. Only need
>> one node in your allocation as we don't need a fountain of output.
>>>
>>> I'll look into the segfault - hard to understand offhand, but could be
> an uninitialized variable. If you have a chance, could you rerun that test
> with "-mca plm_base_verbose 10" on the cmd line?
>>>
>>> Thanks again
>>> Ralph
>>>
>>> On Jun 6, 2014, at 10:31 AM, Dan Dietz <ddietz_at_[hidden]> wrote:
>>>
>>>> Thanks for the reply. I tried out the --display-allocation option with
>>>> several different combinations and have attached the output. I see
>>>> this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters.
>>>>
>>>>
>>>> Here's debugging info on the segfault. Does that help? FWIW this does
>>>> not seem to crash on the RHEL5 cluster or RHEL6.5 cluster. Just
>>>> crashes on RHEL6.4.
>>>>
>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ gdb -c core.22623
>>>> `which mpirun`
>>>> No symbol table is loaded. Use the "file" command.
>>>> GNU gdb (GDB) 7.5-1.3.187
>>>> Copyright (C) 2012 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
>>>> This is free software: you are free to change and redistribute it.
>>>> There is NO WARRANTY, to the extent permitted by law. Type "show
> copying"
>>>> and "show warranty" for details.
>>>> This GDB was configured as "x86_64-unknown-linux-gnu".
>>>> For bug reporting instructions, please see:
>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>> Reading symbols from
>>
>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin/mpirun...done.
>
>>>> [New LWP 22623]
>>>> [New LWP 22624]
>>>>
>>>> warning: Can't read pathname for load map: Input/output error.
>>>> [Thread debugging using libthread_db enabled]
>>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>>> Core was generated by `mpirun -np 2 -machinefile ./nodes ./hello'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1,
>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422
>>>> 422 node->hostid = node->daemon->name.vpid;
>>>> (gdb) bt
>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1,
>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422
>>>> #1 0x00002acc60eec145 in opal_libevent2021_event_base_loop () from
>>
>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6
>
>>>> #2 0x00000000004073b5 in orterun (argc=6, argv=0x7fff5bb2a3a8) at
>>>> orterun.c:1077
>>>> #3 0x00000000004048f4 in main (argc=6, argv=0x7fff5bb2a3a8) at
> main.c:13
>>>>
>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ cat nodes
>>>> conte-a009
>>>> conte-a009
>>>> conte-a055
>>>> conte-a055
>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ uname -r
>>>> 2.6.32-358.14.1.el6.x86_64
>>>>
>>>> On Thu, Jun 5, 2014 at 7:54 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
>>>>>
>>>>> On Jun 5, 2014, at 2:13 PM, Dan Dietz <ddietz_at_[hidden]> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP
>>>>>> codes. In OMPI 1.6.3, I can do:
>>>>>>
>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello
>>>>>>
>>>>>> I get one rank bound to procs 0-7 and the other bound to 8-15.
> Great!
>>>>>>
>>>>>> But I'm having some difficulties doing this with openmpi 1.8.1:
>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello
>>>>>>
> --------------------------------------------------------------------------
>>>>>> The following command line options and corresponding MCA parameter
> have
>>>>>> been deprecated and replaced as follows:
>>>>>>
>>>>>> Command line options:
>>>>>> Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank,
>>>>>> -cpus-per-rank
>>>>>> Replacement: --map-by <obj>:PE=N
>>>>>>
>>>>>> Equivalent MCA parameter:
>>>>>> Deprecated: rmaps_base_cpus_per_proc
>>>>>> Replacement: rmaps_base_mapping_policy=<obj>:PE=N
>>>>>>
>>>>>> The deprecated forms *will* disappear in a future version of Open
> MPI.
>>>>>> Please update to the new syntax.
>>>>>>
> --------------------------------------------------------------------------
>>>>>>
> --------------------------------------------------------------------------
>>>>>> There are not enough slots available in the system to satisfy the 2
> slots
>>>>>> that were requested by the application:
>>>>>> ./hello
>>>>>>
>>>>>> Either request fewer slots for your application, or make more slots
> available
>>>>>> for use.
>>>>>>
> --------------------------------------------------------------------------
>>>>>>
>>>>>> OK, let me try the new syntax...
>>>>>>
>>>>>> $ mpirun -np 2 --map-by core:pe=8 -machinefile ./nodes ./hello
>>>>>>
> --------------------------------------------------------------------------
>>>>>> There are not enough slots available in the system to satisfy the 2
> slots
>>>>>> that were requested by the application:
>>>>>> ./hello
>>>>>>
>>>>>> Either request fewer slots for your application, or make more slots
> available
>>>>>> for use.
>>>>>>
> --------------------------------------------------------------------------
>>>>>>
>>>>>> What am I doing wrong? The documentation on these new options is
>>>>>> somewhat poor and confusing so I'm probably doing something wrong.
> If
>>>>>> anyone could provide some pointers here it'd be much appreciated! If
>>>>>> it's not something simple and you need config logs and such please
> let
>>>>>> me know.
>>>>>
>>>>> Looks like we think there are less than 16 slots allocated on that
> node. What is in this "nodes" file? Without it, OMPI should read the Torque
> allocation directly. You might check what we think
>> the allocation is by adding --display-allocation to the cmd line
>>>>>
>>>>>>
>>>>>> As a side note -
>>>>>>
>>>>>> If I try this using the PBS nodefile with the above, I get a
> confusing message:
>>>>>>
>>>>>>
> --------------------------------------------------------------------------
>>>>>> A request for multiple cpus-per-proc was given, but a directive
>>>>>> was also give to map to an object level that has less cpus than
>>>>>> requested ones:
>>>>>>
>>>>>> #cpus-per-proc: 8
>>>>>> number of cpus: 1
>>>>>> map-by: BYCORE:NOOVERSUBSCRIBE
>>>>>>
>>>>>> Please specify a mapping level that has more cpus, or else let us
>>>>>> define a default mapping that will allow multiple cpus-per-proc.
>>>>>>
> --------------------------------------------------------------------------
>>>>>>
>>>>>> From what I've gathered this is because I have a node listed 16
> times
>>>>>> in my PBS nodefile so it's assuming then I have 1 core per node?
>>>>>
>>>>>
>>>>> No - if listed 16 times, it should compute 16 slots. Try adding
> --display-allocation to your cmd line and it should tell you how many slots
> are present.
>>>>>
>>>>> However, it doesn't assume there is a core for each slot. Instead, it
> detects the cores directly on the node. It sounds like it isn't seeing them
> for some reason. What OS are you running on that
>> node?
>>>>>
>>>>> FWIW: the 1.6 series has a different detection system for cores.
> Could be something is causing problems for the new one.
>>>>>
>>>>>> Some
>>>>>> better documentation here would be helpful. I haven't been able to
>>>>>> figure out how to use the "oversubscribe" option listed in the docs.
>>>>>> Not that I really want to oversubscribe, of course, I need to modify
>>>>>> the nodefile, but this just stumped me for a while as 1.6.3 didn't
>>>>>> have this behavior.
>>>>>>
>>>>>>
>>>>>> As a extra bonus, I get a segfault in this situation:
>>>>>>
>>>>>> $ mpirun -np 2 -machinefile ./nodes ./hello
>>>>>> [conte-a497:13255] *** Process received signal ***
>>>>>> [conte-a497:13255] Signal: Segmentation fault (11)
>>>>>> [conte-a497:13255] Signal code: Address not mapped (1)
>>>>>> [conte-a497:13255] Failing at address: 0x2c
>>>>>> [conte-a497:13255] [ 0] /lib64/libpthread.so.0[0x3c9460f500]
>>>>>> [conte-a497:13255] [ 1]
>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7
> (orte_plm_base_complete_setup+0x615)[0x2ba960a59015]
>>>>>> [conte-a497:13255] [ 2]
>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6
> (opal_libevent2021_event_base_loop+0xa05)[0x2ba961666715]
>>>>>> [conte-a497:13255] [ 3] mpirun(orterun+0x1b45)[0x40684f]
>>>>>> [conte-a497:13255] [ 4] mpirun(main+0x20)[0x4047f4]
>>>>>> [conte-a497:13255] [ 5] /lib64/libc.so.6(__libc_start_main
> +0xfd)[0x3a1bc1ecdd]
>>>>>> [conte-a497:13255] [ 6] mpirun[0x404719]
>>>>>> [conte-a497:13255] *** End of error message ***
>>>>>> Segmentation fault (core dumped)
>>>>>>
>>>>>
>>>>> Huh - that's odd. Could you perhaps configure OMPI with
> --enable-debug and gdb the core file to tell us the line numbers involved?
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>>
>>>>>> My "nodes" file simply contains the first two lines of my original
>>>>>> $PBS_NODEFILE provided by Torque. See above why I modified. Works
> fine
>>>>>> if use the full file.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks in advance for any pointers you all may have!
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dan Dietz
>>>>>> Scientific Applications Analyst
>>>>>> ITaP Research Computing, Purdue University
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> --
>>>> Dan Dietz
>>>> Scientific Applications Analyst
>>>> ITaP Research Computing, Purdue University
>>>> <slots>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> Dan Dietz
>> Scientific Applications Analyst
>> ITaP Research Computing, Purdue University
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users