Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM component init unload
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-07-03 14:28:57


Sounds strange - the locality is definitely being set in the code. Can you run it with -mca hwloc_base_verbose 5 --display-map? Should tell us where it thinks things are running, and what locality it is recording.

On Jul 3, 2012, at 11:54 AM, Juan Antonio Rico Gallego wrote:

> Hello everyone. Maybe you can help me:
>
> I got a subversion (r 26725) from the developers trunk. I configure with:
>
> ../../onecopy/ompi-trunk/configure --prefix=/home/jarico/shared/packages/openmpi-cas-dbg --disable-shared --enable-static --enable-debug --enable-mem-profile --enable-mem-debug CFLAGS=-g
>
> Compiling is ok, but when I try to run in a shared memory machine with the SM component:
>
> /home/jarico/shared/packages/openmpi-cas-dbg/bin/mpiexec --mca mca_base_verbose 100 --mca mca_coll_base_output 100 --mca coll sm,self --mca coll_sm_priority 99 -n 2 ./bcast
>
> I get the error message:
>
>
> --------------------------------------------------------------------------
> Although some coll components are available on your system, none of
> them said that they could be used for a new communicator.
>
> This is extremely unusual -- either the "basic" or "self" components
> should be able to be chosen for any communicator. As such, this
> likely means that something else is wrong (although you should double
> check that the "basic" and "self" coll components are available on
> your system -- check the output of the "ompi_info" command).
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> mca_coll_base_comm_select(MPI_COMM_WORLD) failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [Metropolis-01:15120] *** An error occurred in MPI_Init
> [Metropolis-01:15120] *** reported by process [3914661889,0]
> [Metropolis-01:15120] *** on a NULL communicator
> [Metropolis-01:15120] *** Unknown error
> [Metropolis-01:15120] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [Metropolis-01:15120] *** and potentially your MPI job)
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly. You should
> double check that everything has shut down cleanly.
>
> Reason: Before MPI_INIT completed
> Local host: Metropolis-01
> PID: 15120
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 0 with PID 15120 on
> node Metropolis-01 exiting improperly. There are three reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
> orte_create_session_dirs is set to false. In this case, the run-time cannot
> detect that the abort call was an abnormal termination. Hence, the only
> error message you will receive is this one.
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
>
> You can avoid this message by specifying -quiet on the mpiexec command line.
>
> --------------------------------------------------------------------------
> [Metropolis-01:15119] 1 more process has sent help message help-mca-coll-base / comm-select:none-available
> [Metropolis-01:15119] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> [Metropolis-01:15119] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
> [Metropolis-01:15119] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
> [Metropolis-01:15119] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
> [jarico_at_Metropolis-01 examples]$
>
>
>
> It seems a problem choosing SM component because of the locality of the processes. The mca_coll_sm_init_query function returns OMPI_ERR_NOT_AVAILABLE.
> I remember that in previous releases (about 26206) I needed to change a little the ompi_proc_init() function, adding the lines:
>
> } else {
> /* get the locality information */
> proc->proc_flags = orte_ess.proc_get_locality(&proc->proc_name);
> /* get the name of the node it is on */
> proc->proc_hostname = orte_ess.proc_get_hostname(&proc->proc_name);
> }
>
>
> enough for running ok. But this function has changed and this code does not work. I am not sure now what I am doing bad.
>
> Thanks for your time,
> Juan A. Rico
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel