Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
From: Dan Dietz (ddietz_at_[hidden])
Date: 2014-06-05 17:13:43


Hello all,

I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP
codes. In OMPI 1.6.3, I can do:

$ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello

I get one rank bound to procs 0-7 and the other bound to 8-15. Great!

But I'm having some difficulties doing this with openmpi 1.8.1:

$ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello
--------------------------------------------------------------------------
The following command line options and corresponding MCA parameter have
been deprecated and replaced as follows:

  Command line options:
    Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank,
-cpus-per-rank
    Replacement: --map-by <obj>:PE=N

  Equivalent MCA parameter:
    Deprecated: rmaps_base_cpus_per_proc
    Replacement: rmaps_base_mapping_policy=<obj>:PE=N

The deprecated forms *will* disappear in a future version of Open MPI.
Please update to the new syntax.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  ./hello

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

OK, let me try the new syntax...

$ mpirun -np 2 --map-by core:pe=8 -machinefile ./nodes ./hello
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  ./hello

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

What am I doing wrong? The documentation on these new options is
somewhat poor and confusing so I'm probably doing something wrong. If
anyone could provide some pointers here it'd be much appreciated! If
it's not something simple and you need config logs and such please let
me know.

As a side note -

If I try this using the PBS nodefile with the above, I get a confusing message:

--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:

  #cpus-per-proc: 8
  number of cpus: 1
  map-by: BYCORE:NOOVERSUBSCRIBE

Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
--------------------------------------------------------------------------

>From what I've gathered this is because I have a node listed 16 times
in my PBS nodefile so it's assuming then I have 1 core per node? Some
better documentation here would be helpful. I haven't been able to
figure out how to use the "oversubscribe" option listed in the docs.
Not that I really want to oversubscribe, of course, I need to modify
the nodefile, but this just stumped me for a while as 1.6.3 didn't
have this behavior.

As a extra bonus, I get a segfault in this situation:

$ mpirun -np 2 -machinefile ./nodes ./hello
[conte-a497:13255] *** Process received signal ***
[conte-a497:13255] Signal: Segmentation fault (11)
[conte-a497:13255] Signal code: Address not mapped (1)
[conte-a497:13255] Failing at address: 0x2c
[conte-a497:13255] [ 0] /lib64/libpthread.so.0[0x3c9460f500]
[conte-a497:13255] [ 1]
/apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2ba960a59015]
[conte-a497:13255] [ 2]
/apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2ba961666715]
[conte-a497:13255] [ 3] mpirun(orterun+0x1b45)[0x40684f]
[conte-a497:13255] [ 4] mpirun(main+0x20)[0x4047f4]
[conte-a497:13255] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3a1bc1ecdd]
[conte-a497:13255] [ 6] mpirun[0x404719]
[conte-a497:13255] *** End of error message ***
Segmentation fault (core dumped)

My "nodes" file simply contains the first two lines of my original
$PBS_NODEFILE provided by Torque. See above why I modified. Works fine
if use the full file.

Thanks in advance for any pointers you all may have!

Dan

-- 
Dan Dietz
Scientific Applications Analyst
ITaP Research Computing, Purdue University