Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] init of component openib returned failure
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-05-18 17:40:36


Try running with:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 --mca btl_openib_verbose 100 ./IMB-MPI1 -npmin 2 PingPong

Also, are you saying that running the same command line with osu_latency works just fine? That would be really weird...

On May 18, 2010, at 6:18 AM, Peter Kruse wrote:

> Hello,
>
> trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing
> the component openib. System is Debian GNU/Linux 5.0.4.
> The command to start the job (under Torque 2.4.7) was:
>
> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2
> ./IMB-MPI1 -npmin 2 PingPong
>
> and results in these messages:
>
> ----------------------------8<----------------------------------------------
>
> [beo-15:20933] mca: base: components_open: Looking for btl components
> [beo-16:20605] mca: base: components_open: Looking for btl components
> [beo-15:20933] mca: base: components_open: opening btl components
> [beo-15:20933] mca: base: components_open: found loaded component openib
> [beo-15:20933] mca: base: components_open: component openib has no register
> function
> [beo-15:20933] mca: base: components_open: component openib open function
> successful
> [beo-15:20933] mca: base: components_open: found loaded component self
> [beo-15:20933] mca: base: components_open: component self has no register function
> [beo-15:20933] mca: base: components_open: component self open function successful
> [beo-16:20605] mca: base: components_open: opening btl components
> [beo-16:20605] mca: base: components_open: found loaded component openib
> [beo-16:20605] mca: base: components_open: component openib has no register
> function
> [beo-16:20605] mca: base: components_open: component openib open function
> successful
> [beo-16:20605] mca: base: components_open: found loaded component self
> [beo-16:20605] mca: base: components_open: component self has no register function
> [beo-16:20605] mca: base: components_open: component self open function successful
> [beo-15:20933] select: initializing btl component openib
> [beo-15:20933] select: init of component openib returned failure
> [beo-15:20933] select: module openib unloaded
> [beo-15:20933] select: initializing btl component self
> [beo-15:20933] select: init of component self returned success
> [beo-16:20605] select: initializing btl component openib
> [beo-16:20605] select: init of component openib returned failure
> [beo-16:20605] select: module openib unloaded
> [beo-16:20605] select: initializing btl component self
> [beo-16:20605] select: init of component self returned success
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[4887,1],0]) is on host: beo-15
> Process 2 ([[4887,1],1]) is on host: beo-16
> BTLs attempted: self
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init_thread
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [beo-15:20933] Abort before MPI_INIT completed successfully; not able to
> guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> orterun has exited due to process rank 0 with PID 20933 on
> node beo-15 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by orterun (as reported here).
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init_thread
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [beo-16:20605] Abort before MPI_INIT completed successfully; not able to
> guarantee that all other processes were killed!
> [beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt /
> unreachable proc
> [beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
> [beo-15:20930] 1 more process has sent help message help-mpi-runtime /
> mpi_init:startup:internal-failure
>
> ----------------------------8<----------------------------------------------
>
> running another Benchmark (OSU) succeeds in loading the openib component.
>
> "ibstat |grep -i state" on both nodes gives:
>
> ----------------------------8<----------------------------------------------
> State: Active
> Physical state: LinkUp
> ----------------------------8<----------------------------------------------
>
> Running with "mpi_abort_delay -1" and attaching an strace on the process
> is not very helpful it loops with:
>
> ----------------------------8<----------------------------------------------
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART,
> 0x2aee59d44f60}, 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> nanosleep({5, 0}, {5, 0}) = 0
> ----------------------------8<----------------------------------------------
>
> Does anybody have an idea what is wrong or how can we get more debugging
> information about the initialization of the openib module?
>
> Thanks for any help,
>
> Peter
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/