Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] rankfile syntax
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-23 13:32:36


Found it - a simple "<=" instead of "<".

Test compiling now - should be in trunk shortly.

Thanks, oh test pilot!

On Jul 23, 2009, at 11:19 AM, Eugene Loh wrote:

> Oh ye gods of rankfiles:
>
> I have a node that has two sockets, each with four cores. If I use
> a rankfile, I can bind to a specific core, a specific range of
> cores, or a specific core or range of cores of a specific socket.
> I'm having trouble binding to all cores of a specific socket. It's
> looking for core 4 on socket 0. I understand why it can't find it,
> but I don't understand why it's looking for it. Bug? My error/
> misunderstanding? Here's what the flight recorder black box says:
>
>
> % cat rankfile
> rank 0=saem9 slot=0:*
> % mpirun -np 1 --host saem9 --rankfile rankfile --mca
> paffinity_base_verbose 5 ./a.out
> [saem9:20649] mca:base:select:(paffinity) Querying component [linux]
> [saem9:20649] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> [saem9:20649] mca:base:select:(paffinity) Selected component [linux]
> [saem9:20650] mca:base:select:(paffinity) Querying component [linux]
> [saem9:20650] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> [saem9:20650] mca:base:select:(paffinity) Selected component [linux]
> [saem9:20650] paffinity slot assignment: slot_list == 0:*
> [saem9:20650] Rank 0: PAFFINITY cannot get physical core id for
> logical core 4 in physical socket 0 (0)
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> opal_paffinity_base_slot_list_set() returned an error
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [saem9:20650] Abort before MPI_INIT completed successfully; not able
> to guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 20650 on
> node saem9 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel