Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
From: Dan Dietz (ddietz_at_[hidden])
Date: 2014-06-12 12:04:23


That shouldn't be a problem. Let me figure out the process and I'll
get back to you.

Dan

On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> Arggh - is there any way I can get access to this beast so I can debug this? I can't figure out what in the world is going on, but it seems to be something triggered by your specific setup.
>
>
> On Jun 12, 2014, at 8:48 AM, Dan Dietz <ddietz_at_[hidden]> wrote:
>
>> Unfortunately, the nightly tarball appears to be crashing in a similar
>> fashion. :-( I used the latest snapshot 1.8.2a1r31981.
>>
>> Dan
>>
>> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> I've poked and prodded, and the 1.8.2 tarball seems to be handling this situation just fine. I don't have access to a Torque machine, but I did set everything to follow the same code path, added faux coprocessors, etc. - and it ran just fine.
>>>
>>> Can you try the 1.8.2 tarball and see if it solves the problem?
>>>
>>>
>>> On Jun 11, 2014, at 2:15 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> Okay, let me poke around some more. It is clearly tied to the coprocessors, but I'm not yet sure just why.
>>>>
>>>> One thing you might do is try the nightly 1.8.2 tarball - there have been a number of fixes, and this may well have been caught there. Worth taking a look.
>>>>
>>>>
>>>> On Jun 11, 2014, at 6:44 AM, Dan Dietz <ddietz_at_[hidden]> wrote:
>>>>
>>>>> Sorry - it crashes with both torque and rsh launchers. The output from
>>>>> a gdb backtrace on the core files looks identical.
>>>>>
>>>>> Dan
>>>>>
>>>>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>> Afraid I'm a little confused now - are you saying it works fine under Torque, but segfaults under rsh? Could you please clarify your current situation?
>>>>>>
>>>>>>
>>>>>> On Jun 11, 2014, at 6:27 AM, Dan Dietz <ddietz_at_[hidden]> wrote:
>>>>>>
>>>>>>> It looks like it is still segfaulting with the rsh launcher:
>>>>>>>
>>>>>>> ddietz_at_conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
>>>>>>> -np 4 -machinefile ./nodes ./hello
>>>>>>> [conte-a084:51113] *** Process received signal ***
>>>>>>> [conte-a084:51113] Signal: Segmentation fault (11)
>>>>>>> [conte-a084:51113] Signal code: Address not mapped (1)
>>>>>>> [conte-a084:51113] Failing at address: 0x2c
>>>>>>> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
>>>>>>> [conte-a084:51113] [ 1]
>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
>>>>>>> [conte-a084:51113] [ 2]
>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
>>>>>>> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
>>>>>>> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
>>>>>>> [conte-a084:51113] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
>>>>>>> [conte-a084:51113] [ 6] mpirun[0x404719]
>>>>>>> [conte-a084:51113] *** End of error message ***
>>>>>>> Segmentation fault (core dumped)
>>>>>>>
>>>>>>> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>> I'm having no luck poking at this segfault issue. For some strange reason, we seem to think there are coprocessors on those remote nodes - e.g., a Phi card. Yet your lstopo output doesn't seem to show it.
>>>>>>>>
>>>>>>>> Out of curiosity, can you try running this with "-mca plm rsh"? This will substitute the rsh/ssh launcher in place of Torque - assuming your system will allow it, this will let me see if the problem is somewhere in the Torque launcher or elsewhere in OMPI.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> On Jun 6, 2014, at 12:48 PM, Dan Dietz <ddietz_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> No problem -
>>>>>>>>>
>>>>>>>>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
>>>>>>>>> 2 per node, 8 cores each. No threading enabled.
>>>>>>>>>
>>>>>>>>> $ lstopo
>>>>>>>>> Machine (64GB)
>>>>>>>>> NUMANode L#0 (P#0 32GB)
>>>>>>>>> Socket L#0 + L3 L#0 (20MB)
>>>>>>>>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>>>>>>>>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
>>>>>>>>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
>>>>>>>>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
>>>>>>>>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
>>>>>>>>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
>>>>>>>>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
>>>>>>>>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
>>>>>>>>> HostBridge L#0
>>>>>>>>> PCIBridge
>>>>>>>>> PCI 1000:0087
>>>>>>>>> Block L#0 "sda"
>>>>>>>>> PCIBridge
>>>>>>>>> PCI 8086:2250
>>>>>>>>> PCIBridge
>>>>>>>>> PCI 8086:1521
>>>>>>>>> Net L#1 "eth0"
>>>>>>>>> PCI 8086:1521
>>>>>>>>> Net L#2 "eth1"
>>>>>>>>> PCIBridge
>>>>>>>>> PCI 102b:0533
>>>>>>>>> PCI 8086:1d02
>>>>>>>>> NUMANode L#1 (P#1 32GB)
>>>>>>>>> Socket L#1 + L3 L#1 (20MB)
>>>>>>>>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
>>>>>>>>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
>>>>>>>>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>>>>>>>> + PU L#10 (P#10)
>>>>>>>>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>>>>>>>> + PU L#11 (P#11)
>>>>>>>>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>>>>>>>> + PU L#12 (P#12)
>>>>>>>>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>>>>>>>> + PU L#13 (P#13)
>>>>>>>>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>>>>>>>> + PU L#14 (P#14)
>>>>>>>>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>>>>>>>> + PU L#15 (P#15)
>>>>>>>>> HostBridge L#5
>>>>>>>>> PCIBridge
>>>>>>>>> PCI 15b3:1011
>>>>>>>>> Net L#3 "ib0"
>>>>>>>>> OpenFabrics L#4 "mlx5_0"
>>>>>>>>> PCIBridge
>>>>>>>>> PCI 8086:2250
>>>>>>>>>
>>>>>>>>> From the segfault below. I tried reproducing the crash on less than an
>>>>>>>>> 4 node allocation but wasn't able to.
>>>>>>>>>
>>>>>>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ mpirun -np 2
>>>>>>>>> -machinefile ./nodes -mca plm_base_verbose 10 ./hello
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> registering plm components
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> found loaded component isolated
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> component isolated has no register or open function
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> found loaded component slurm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> component slurm register function successful
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> found loaded component rsh
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> component rsh register function successful
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> found loaded component tm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>> component tm register function successful
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: opening
>>>>>>>>> plm components
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>> loaded component isolated
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>> component isolated open function successful
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>> loaded component slurm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>> component slurm open function successful
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>> loaded component rsh
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>> component rsh open function successful
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>> loaded component tm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>> component tm open function successful
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select: Auto-selecting plm
>>>>>>>>> components
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>>>>>>>>> component [isolated]
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of
>>>>>>>>> component [isolated] set priority to 0
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>>>>>>>>> component [slurm]
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Skipping
>>>>>>>>> component [slurm]. Query failed to return a module
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>>>>>>>>> component [rsh]
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[INVALID],INVALID] plm:rsh_lookup
>>>>>>>>> on agent ssh : rsh path NULL
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of
>>>>>>>>> component [rsh] set priority to 10
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying
>>>>>>>>> component [tm]
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of
>>>>>>>>> component [tm] set priority to 75
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Selected
>>>>>>>>> component [tm]
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component isolated closed
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading
>>>>>>>>> component isolated
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component slurm closed
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading component slurm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component rsh closed
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading component rsh
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: initial bias
>>>>>>>>> 55685 nodename hash 3965217721
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: final jobfam 24164
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:receive start comm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_job
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm creating map
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm add
>>>>>>>>> new daemon [[24164,0],1]
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm
>>>>>>>>> assigning new daemon [[24164,0],1] to node conte-a055
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching vm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: final top-level argv:
>>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid
>>>>>>>>> <template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>>>>>>>> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380"
>>>>>>>>> -mca plm_base_verbose 10
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting
>>>>>>>>> LD_LIBRARY_PATH:
>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/usr/pbs/lib:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mpirt/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/ipp/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/tbb/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/lib64
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting
>>>>>>>>> PATH: /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/bin/intel64:/opt/intel/mic/bin:/apps/rhel6/intel/inspector_xe_2013/bin64:/apps/rhel6/intel/advisor_xe_2013/bin64:/apps/rhel6/intel/vtune_amplifier_xe_2013/bin64:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/bin:/usr/lib64/qt-3.3/bin:/opt/moab/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/hpss/bin:/opt/hsi/bin:/opt/ibutils/bin:/usr/pbs/bin:/opt/moab/bin:/usr/site/rcac/scripts:/usr/site/rcac/support_scripts:/usr/site/rcac/bin:/usr/site/rcac/sbin:/usr/sbin
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching on
>>>>>>>>> node conte-a055
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: executing:
>>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid 1
>>>>>>>>> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>>>>>>>> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380"
>>>>>>>>> -mca plm_base_verbose 10
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm:launch:
>>>>>>>>> finished spawning orteds
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>>>>>>>>> registering plm components
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>>>>>>>>> found loaded component rsh
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>>>>>>>>> component rsh register function successful
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: opening
>>>>>>>>> plm components
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: found
>>>>>>>>> loaded component rsh
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open:
>>>>>>>>> component rsh open function successful
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select: Auto-selecting plm
>>>>>>>>> components
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Querying
>>>>>>>>> component [rsh]
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_lookup on
>>>>>>>>> agent ssh : rsh path NULL
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Query of
>>>>>>>>> component [rsh] set priority to 10
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Selected
>>>>>>>>> component [rsh]
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_setup on
>>>>>>>>> agent ssh : rsh path NULL
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive start comm
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>>>>>>>>> plm:base:orted_report_launch from daemon [[24164,0],1]
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>>>>>>>>> plm:base:orted_report_launch from daemon [[24164,0],1] on node
>>>>>>>>> conte-a055
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] RECEIVED TOPOLOGY
>>>>>>>>> FROM NODE conte-a055
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] NEW TOPOLOGY - ADDING
>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>>>>>>>>> plm:base:orted_report_launch completed for daemon [[24164,0],1] at
>>>>>>>>> contact 1583611904.1;tcp://172.18.96.95,172.31.1.254,172.31.2.254,172.18.112.95:58312
>>>>>>>>> [conte-a009:55685] *** Process received signal ***
>>>>>>>>> [conte-a009:55685] Signal: Segmentation fault (11)
>>>>>>>>> [conte-a009:55685] Signal code: Address not mapped (1)
>>>>>>>>> [conte-a009:55685] Failing at address: 0x4c
>>>>>>>>> [conte-a009:55685] [ 0] /lib64/libpthread.so.0[0x327f80f500]
>>>>>>>>> [conte-a009:55685] [ 1]
>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x951)[0x2b5b069a50e1]
>>>>>>>>> [conte-a009:55685] [ 2]
>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b5b075ff145]
>>>>>>>>> [conte-a009:55685] [ 3] mpirun(orterun+0x1ffd)[0x4073b5]
>>>>>>>>> [conte-a009:55685] [ 4] mpirun(main+0x20)[0x4048f4]
>>>>>>>>> [conte-a009:55685] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x327f41ecdd]
>>>>>>>>> [conte-a009:55685] [ 6] mpirun[0x404819]
>>>>>>>>> [conte-a009:55685] *** End of error message ***
>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$
>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive stop
>>>>>>>>> comm
>>>>>>>>>
>>>>>>>>> On Fri, Jun 6, 2014 at 3:00 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>> Sorry to pester with questions, but I'm trying to narrow down the issue.
>>>>>>>>>>
>>>>>>>>>> * What kind of chips are on these machines?
>>>>>>>>>>
>>>>>>>>>> * If they have h/w threads, are they enabled?
>>>>>>>>>>
>>>>>>>>>> * you might have lstopo on one of those machines - could you pass along its output? Otherwise, you can run a simple "mpirun -n 1 -mca ess_base_verbose 20 hostname" and it will print out. Only need one node in your allocation as we don't need a fountain of output.
>>>>>>>>>>
>>>>>>>>>> I'll look into the segfault - hard to understand offhand, but could be an uninitialized variable. If you have a chance, could you rerun that test with "-mca plm_base_verbose 10" on the cmd line?
>>>>>>>>>>
>>>>>>>>>> Thanks again
>>>>>>>>>> Ralph
>>>>>>>>>>
>>>>>>>>>> On Jun 6, 2014, at 10:31 AM, Dan Dietz <ddietz_at_[hidden]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for the reply. I tried out the --display-allocation option with
>>>>>>>>>>> several different combinations and have attached the output. I see
>>>>>>>>>>> this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here's debugging info on the segfault. Does that help? FWIW this does
>>>>>>>>>>> not seem to crash on the RHEL5 cluster or RHEL6.5 cluster. Just
>>>>>>>>>>> crashes on RHEL6.4.
>>>>>>>>>>>
>>>>>>>>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ gdb -c core.22623
>>>>>>>>>>> `which mpirun`
>>>>>>>>>>> No symbol table is loaded. Use the "file" command.
>>>>>>>>>>> GNU gdb (GDB) 7.5-1.3.187
>>>>>>>>>>> Copyright (C) 2012 Free Software Foundation, Inc.
>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>>>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>> This GDB was configured as "x86_64-unknown-linux-gnu".
>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>>>>>>>>> Reading symbols from
>>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin/mpirun...done.
>>>>>>>>>>> [New LWP 22623]
>>>>>>>>>>> [New LWP 22624]
>>>>>>>>>>>
>>>>>>>>>>> warning: Can't read pathname for load map: Input/output error.
>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>>>>>>>>>> Core was generated by `mpirun -np 2 -machinefile ./nodes ./hello'.
>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1,
>>>>>>>>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422
>>>>>>>>>>> 422 node->hostid = node->daemon->name.vpid;
>>>>>>>>>>> (gdb) bt
>>>>>>>>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1,
>>>>>>>>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422
>>>>>>>>>>> #1 0x00002acc60eec145 in opal_libevent2021_event_base_loop () from
>>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6
>>>>>>>>>>> #2 0x00000000004073b5 in orterun (argc=6, argv=0x7fff5bb2a3a8) at
>>>>>>>>>>> orterun.c:1077
>>>>>>>>>>> #3 0x00000000004048f4 in main (argc=6, argv=0x7fff5bb2a3a8) at main.c:13
>>>>>>>>>>>
>>>>>>>>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ cat nodes
>>>>>>>>>>> conte-a009
>>>>>>>>>>> conte-a009
>>>>>>>>>>> conte-a055
>>>>>>>>>>> conte-a055
>>>>>>>>>>> ddietz_at_conte-a009:/scratch/conte/d/ddietz/hello$ uname -r
>>>>>>>>>>> 2.6.32-358.14.1.el6.x86_64
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 5, 2014 at 7:54 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 5, 2014, at 2:13 PM, Dan Dietz <ddietz_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP
>>>>>>>>>>>>> codes. In OMPI 1.6.3, I can do:
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello
>>>>>>>>>>>>>
>>>>>>>>>>>>> I get one rank bound to procs 0-7 and the other bound to 8-15. Great!
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I'm having some difficulties doing this with openmpi 1.8.1:
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> The following command line options and corresponding MCA parameter have
>>>>>>>>>>>>> been deprecated and replaced as follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Command line options:
>>>>>>>>>>>>> Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank,
>>>>>>>>>>>>> -cpus-per-rank
>>>>>>>>>>>>> Replacement: --map-by <obj>:PE=N
>>>>>>>>>>>>>
>>>>>>>>>>>>> Equivalent MCA parameter:
>>>>>>>>>>>>> Deprecated: rmaps_base_cpus_per_proc
>>>>>>>>>>>>> Replacement: rmaps_base_mapping_policy=<obj>:PE=N
>>>>>>>>>>>>>
>>>>>>>>>>>>> The deprecated forms *will* disappear in a future version of Open MPI.
>>>>>>>>>>>>> Please update to the new syntax.
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> There are not enough slots available in the system to satisfy the 2 slots
>>>>>>>>>>>>> that were requested by the application:
>>>>>>>>>>>>> ./hello
>>>>>>>>>>>>>
>>>>>>>>>>>>> Either request fewer slots for your application, or make more slots available
>>>>>>>>>>>>> for use.
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, let me try the new syntax...
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ mpirun -np 2 --map-by core:pe=8 -machinefile ./nodes ./hello
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> There are not enough slots available in the system to satisfy the 2 slots
>>>>>>>>>>>>> that were requested by the application:
>>>>>>>>>>>>> ./hello
>>>>>>>>>>>>>
>>>>>>>>>>>>> Either request fewer slots for your application, or make more slots available
>>>>>>>>>>>>> for use.
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> What am I doing wrong? The documentation on these new options is
>>>>>>>>>>>>> somewhat poor and confusing so I'm probably doing something wrong. If
>>>>>>>>>>>>> anyone could provide some pointers here it'd be much appreciated! If
>>>>>>>>>>>>> it's not something simple and you need config logs and such please let
>>>>>>>>>>>>> me know.
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like we think there are less than 16 slots allocated on that node. What is in this "nodes" file? Without it, OMPI should read the Torque allocation directly. You might check what we think the allocation is by adding --display-allocation to the cmd line
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> As a side note -
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I try this using the PBS nodefile with the above, I get a confusing message:
>>>>>>>>>>>>>
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> A request for multiple cpus-per-proc was given, but a directive
>>>>>>>>>>>>> was also give to map to an object level that has less cpus than
>>>>>>>>>>>>> requested ones:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #cpus-per-proc: 8
>>>>>>>>>>>>> number of cpus: 1
>>>>>>>>>>>>> map-by: BYCORE:NOOVERSUBSCRIBE
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please specify a mapping level that has more cpus, or else let us
>>>>>>>>>>>>> define a default mapping that will allow multiple cpus-per-proc.
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> From what I've gathered this is because I have a node listed 16 times
>>>>>>>>>>>>> in my PBS nodefile so it's assuming then I have 1 core per node?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No - if listed 16 times, it should compute 16 slots. Try adding --display-allocation to your cmd line and it should tell you how many slots are present.
>>>>>>>>>>>>
>>>>>>>>>>>> However, it doesn't assume there is a core for each slot. Instead, it detects the cores directly on the node. It sounds like it isn't seeing them for some reason. What OS are you running on that node?
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW: the 1.6 series has a different detection system for cores. Could be something is causing problems for the new one.
>>>>>>>>>>>>
>>>>>>>>>>>>> Some
>>>>>>>>>>>>> better documentation here would be helpful. I haven't been able to
>>>>>>>>>>>>> figure out how to use the "oversubscribe" option listed in the docs.
>>>>>>>>>>>>> Not that I really want to oversubscribe, of course, I need to modify
>>>>>>>>>>>>> the nodefile, but this just stumped me for a while as 1.6.3 didn't
>>>>>>>>>>>>> have this behavior.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> As a extra bonus, I get a segfault in this situation:
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ mpirun -np 2 -machinefile ./nodes ./hello
>>>>>>>>>>>>> [conte-a497:13255] *** Process received signal ***
>>>>>>>>>>>>> [conte-a497:13255] Signal: Segmentation fault (11)
>>>>>>>>>>>>> [conte-a497:13255] Signal code: Address not mapped (1)
>>>>>>>>>>>>> [conte-a497:13255] Failing at address: 0x2c
>>>>>>>>>>>>> [conte-a497:13255] [ 0] /lib64/libpthread.so.0[0x3c9460f500]
>>>>>>>>>>>>> [conte-a497:13255] [ 1]
>>>>>>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2ba960a59015]
>>>>>>>>>>>>> [conte-a497:13255] [ 2]
>>>>>>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2ba961666715]
>>>>>>>>>>>>> [conte-a497:13255] [ 3] mpirun(orterun+0x1b45)[0x40684f]
>>>>>>>>>>>>> [conte-a497:13255] [ 4] mpirun(main+0x20)[0x4047f4]
>>>>>>>>>>>>> [conte-a497:13255] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3a1bc1ecdd]
>>>>>>>>>>>>> [conte-a497:13255] [ 6] mpirun[0x404719]
>>>>>>>>>>>>> [conte-a497:13255] *** End of error message ***
>>>>>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Huh - that's odd. Could you perhaps configure OMPI with --enable-debug and gdb the core file to tell us the line numbers involved?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Ralph
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> My "nodes" file simply contains the first two lines of my original
>>>>>>>>>>>>> $PBS_NODEFILE provided by Torque. See above why I modified. Works fine
>>>>>>>>>>>>> if use the full file.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance for any pointers you all may have!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Dan Dietz
>>>>>>>>>>>>> Scientific Applications Analyst
>>>>>>>>>>>>> ITaP Research Computing, Purdue University
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Dan Dietz
>>>>>>>>>>> Scientific Applications Analyst
>>>>>>>>>>> ITaP Research Computing, Purdue University
>>>>>>>>>>> <slots>_______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Dan Dietz
>>>>>>>>> Scientific Applications Analyst
>>>>>>>>> ITaP Research Computing, Purdue University
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dan Dietz
>>>>>>> Scientific Applications Analyst
>>>>>>> ITaP Research Computing, Purdue University
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24629.php
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24630.php
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dan Dietz
>>>>> Scientific Applications Analyst
>>>>> ITaP Research Computing, Purdue University
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24631.php
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24642.php
>>
>>
>>
>> --
>> Dan Dietz
>> Scientific Applications Analyst
>> ITaP Research Computing, Purdue University
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24645.php
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24646.php

-- 
Dan Dietz
Scientific Applications Analyst
ITaP Research Computing, Purdue University