Sounds strange - the locality is definitely being set in the code. Can you run it with -mca hwloc_base_verbose 5 --display-map? Should tell us where it thinks things are running, and what locality it is recording.


On Jul 3, 2012, at 11:54 AM, Juan Antonio Rico Gallego wrote:

Hello everyone. Maybe you can help me:

I got a subversion (r 26725) from the developers trunk. I configure with:

../../onecopy/ompi-trunk/configure --prefix=/home/jarico/shared/packages/openmpi-cas-dbg --disable-shared --enable-static --enable-debug --enable-mem-profile --enable-mem-debug CFLAGS=-g

Compiling is ok, but when I try to run in a shared memory machine with the SM component:

/home/jarico/shared/packages/openmpi-cas-dbg/bin/mpiexec --mca mca_base_verbose 100 --mca mca_coll_base_output 100 --mca coll sm,self --mca coll_sm_priority 99  -n 2 ./bcast

I get the error message:


--------------------------------------------------------------------------
Although some coll components are available on your system, none of
them said that they could be used for a new communicator.

This is extremely unusual -- either the "basic" or "self" components
should be able to be chosen for any communicator.  As such, this
likely means that something else is wrong (although you should double
check that the "basic" and "self" coll components are available on
your system -- check the output of the "ompi_info" command).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_coll_base_comm_select(MPI_COMM_WORLD) failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[Metropolis-01:15120] *** An error occurred in MPI_Init
[Metropolis-01:15120] *** reported by process [3914661889,0]
[Metropolis-01:15120] *** on a NULL communicator
[Metropolis-01:15120] *** Unknown error
[Metropolis-01:15120] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Metropolis-01:15120] ***    and potentially your MPI job)
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     Before MPI_INIT completed
  Local host: Metropolis-01
  PID:        15120
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 15120 on
node Metropolis-01 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).

You can avoid this message by specifying -quiet on the mpiexec command line.

--------------------------------------------------------------------------
[Metropolis-01:15119] 1 more process has sent help message help-mca-coll-base / comm-select:none-available
[Metropolis-01:15119] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[Metropolis-01:15119] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[Metropolis-01:15119] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[Metropolis-01:15119] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
[jarico@Metropolis-01 examples]$ 



It seems a problem choosing SM component because of the locality of the processes. The mca_coll_sm_init_query function returns OMPI_ERR_NOT_AVAILABLE. 
I remember that in previous releases (about 26206) I needed to change a little the ompi_proc_init() function, adding the lines:

        } else {
  /* get the locality information */
  proc->proc_flags = orte_ess.proc_get_locality(&proc->proc_name);
  /* get the name of the node it is on */
  proc->proc_hostname = orte_ess.proc_get_hostname(&proc->proc_name);
        }


enough for running ok. But this function has changed and this code does not work. I am not sure now what I am doing bad.

Thanks for your time,
Juan A. Rico
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel