Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] rankfile syntax
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-07-23 13:19:02


Oh ye gods of rankfiles:

I have a node that has two sockets, each with four cores. If I use a
rankfile, I can bind to a specific core, a specific range of cores, or a
specific core or range of cores of a specific socket. I'm having
trouble binding to all cores of a specific socket. It's looking for
core 4 on socket 0. I understand why it can't find it, but I don't
understand why it's looking for it. Bug? My error/misunderstanding?
Here's what the flight recorder black box says:

% cat rankfile
rank 0=saem9 slot=0:*
% mpirun -np 1 --host saem9 --rankfile rankfile --mca
paffinity_base_verbose 5 ./a.out
[saem9:20649] mca:base:select:(paffinity) Querying component [linux]
[saem9:20649] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[saem9:20649] mca:base:select:(paffinity) Selected component [linux]
[saem9:20650] mca:base:select:(paffinity) Querying component [linux]
[saem9:20650] mca:base:select:(paffinity) Query of component [linux] set
priority to 10
[saem9:20650] mca:base:select:(paffinity) Selected component [linux]
[saem9:20650] paffinity slot assignment: slot_list == 0:*
[saem9:20650] Rank 0: PAFFINITY cannot get physical core id for logical
core 4 in physical socket 0 (0)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  opal_paffinity_base_slot_list_set() returned an error
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[saem9:20650] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 20650 on
node saem9 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------