Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI and SLURM
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-01-12 11:29:30


Sadly, we incorrectly removed the required grpcomm component to make that work. I'm restoring it this weekend and we will be issuing a 1.6.4 shortly.

Meantime, you can use the PMI support in its place. Just configure OMPI --with-pmi=<path-to-slurm's-pmi.h> and you will be able to direct-launch your job.

Sorry for the problem.

On Jan 12, 2013, at 7:32 AM, Beat Rubischon <beat_at_[hidden]> wrote:

> Hello!
>
> I'm currently trying to run OpenMPI 1.6.3 binaries directly under SLURM
> 2.5.1 [1]. OpenMPI is built using --with-slurm, $SLURM_STEP_RESV_PORTS
> is successfully set by SLURM. According to the error message I assume a
> shared library couldn't be found, unfortunately I'm not able to find a
> failed stat() or open() in strace.
>
> [1] http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi
>
> It's probably a stupid mistake on my side. It drives me crazy as I
> already realized such setups in the early OpenMPI 1.5 days :-/
>
> Using mpirun works:
>
> $ salloc -n 2 mpirun ./IMB-MPI1
> [dalco_at_master imb]$ salloc -n 2 mpirun ./IMB-MPI1
> salloc: Granted job allocation 72
> #---------------------------------------------------
> # Intel (R) MPI Benchmark Suite V3.2.2, MPI-1 part
> ...
>
> Direct invocation fails:
>
> [dalco_at_master imb]$ salloc -n 2 srun ./IMB-MPI1
> salloc: Granted job allocation 74
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened. This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded). Note that
> Open MPI stopped checking at the first component that it did not find.
>
> Host: node30.cluster
> Framework: grpcomm
> Component: hier
> --------------------------------------------------------------------------
> [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
> base/ess_base_std_app.c at line 93
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_grpcomm_base_open failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
> ess_slurmd_module.c at line 385
> [node30.cluster:42203] [[74,1],0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 128
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_ess_set_name failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> ompi_mpi_init: orte_init failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [node30.cluster:42203] *** An error occurred in MPI_Init_thread
> [node30.cluster:42203] *** on a NULL communicator
> [node30.cluster:42203] *** Unknown error
> [node30.cluster:42203] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly. You should
> double check that everything has shut down cleanly.
>
> Reason: Before MPI_INIT completed
> Local host: node30.cluster
> PID: 42203
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A requested component was not found, or was unable to be opened. This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded). Note that
> Open MPI stopped checking at the first component that it did not find.
>
> Host: node30.cluster
> Framework: grpcomm
> Component: hier
> --------------------------------------------------------------------------
> [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
> base/ess_base_std_app.c at line 93
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_grpcomm_base_open failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
> ess_slurmd_module.c at line 385
> [node30.cluster:42204] [[74,1],1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 128
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_ess_set_name failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> ompi_mpi_init: orte_init failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [node30.cluster:42204] *** An error occurred in MPI_Init_thread
> [node30.cluster:42204] *** on a NULL communicator
> [node30.cluster:42204] *** Unknown error
> [node30.cluster:42204] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly. You should
> double check that everything has shut down cleanly.
>
> Reason: Before MPI_INIT completed
> Local host: node30.cluster
> PID: 42204
> --------------------------------------------------------------------------
> srun: error: node30: tasks 0-1: Exited with exit code 1
> salloc: Relinquishing job allocation 74
> salloc: Job allocation 74 has been revoked.
>
> Thanks for any input!
> Beat
>
> --
> \|/ Beat Rubischon <beat_at_[hidden]>
> ( 0-0 ) http://www.0x1b.ch/~beat/
> oOO--(_)--OOo---------------------------------------------------
> Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users