Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] init of component openib returned failure
From: Peter Kruse (pk_at_[hidden])
Date: 2010-05-19 03:45:58


Hello,

thanks for your reply.

Jeff Squyres wrote:
> Try running with:
>
> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 --mca btl_openib_verbose 100 ./IMB-MPI1 -npmin 2 PingPong

the output is exactly the same as before.

>
> Also, are you saying that running the same command line with osu_latency works just fine? That would be really weird...

Yes, if I run:

mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2
--mca btl_openib_verbose 100 ./osu_lat_ompi-1.4.1

the openib component can be initialized:

----------------------------8<----------------------------------------------

[beo-15:29479] mca: base: components_open: Looking for btl components
[beo-16:29063] mca: base: components_open: Looking for btl components
[beo-15:29479] mca: base: components_open: opening btl components
[beo-15:29479] mca: base: components_open: found loaded component openib
[beo-15:29479] mca: base: components_open: component openib has no register
function
[beo-15:29479] mca: base: components_open: component openib open function
successful
[beo-15:29479] mca: base: components_open: found loaded component self
[beo-15:29479] mca: base: components_open: component self has no register function
[beo-15:29479] mca: base: components_open: component self open function successful
[beo-16:29063] mca: base: components_open: opening btl components
[beo-16:29063] mca: base: components_open: found loaded component openib
[beo-16:29063] mca: base: components_open: component openib has no register
function
[beo-16:29063] mca: base: components_open: component openib open function
successful
[beo-16:29063] mca: base: components_open: found loaded component self
[beo-16:29063] mca: base: components_open: component self has no register function
[beo-16:29063] mca: base: components_open: component self open function successful
[beo-15:29479] select: initializing btl component openib
[beo-16:29063] select: initializing btl component openib
[beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
INI files for vendor 0x02c9, part ID 25204
[beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: Mellanox Sinai Infinihost III
[beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
INI files for vendor 0x0000, part ID 0
[beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: default
[beo-15:29479] openib BTL: oob CPC available for use on mthca0:1
[beo-15:29479] openib BTL: xoob CPC only supported with XRC receive queues;
skipped on mthca0:1
[beo-15:29479] openib BTL: rdmacm CPC available for use on mthca0:1
[beo-15:29479] select: init of component openib returned success
[beo-15:29479] select: initializing btl component self
[beo-15:29479] select: init of component self returned success
[beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
INI files for vendor 0x02c9, part ID 25204
[beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: Mellanox Sinai Infinihost III
[beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
INI files for vendor 0x0000, part ID 0
[beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
corresponding INI values: default
[beo-16:29063] openib BTL: oob CPC available for use on mthca0:1
[beo-16:29063] openib BTL: xoob CPC only supported with XRC receive queues;
skipped on mthca0:1
[beo-16:29063] openib BTL: rdmacm CPC available for use on mthca0:1
[beo-16:29063] select: init of component openib returned success
[beo-16:29063] select: initializing btl component self
[beo-16:29063] select: init of component self returned success
# OSU MPI Latency Test (Version 2.2)
# Size Latency (us)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
[beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
[beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
MTU to IBV value 4 (2048 bytes)
0 3.57
1 3.65
2 3.63
4 3.64
8 3.68
16 3.72
32 3.77
64 3.95
128 4.95
256 5.36
512 6.03
1024 7.64
2048 9.95
4096 12.78
8192 18.22
16384 25.48
32768 37.03
65536 60.21
131072 107.90
262144 201.18
524288 389.08
1048576 762.38
2097152 1510.91
4194304 3005.72
[beo-15:29479] mca: base: close: component openib closed
[beo-16:29063] mca: base: close: component openib closed
[beo-16:29063] mca: base: close: unloading component openib
[beo-15:29479] mca: base: close: unloading component openib
[beo-16:29063] mca: base: close: component self closed
[beo-16:29063] mca: base: close: unloading component self
[beo-15:29479] mca: base: close: component self closed
[beo-15:29479] mca: base: close: unloading component self

----------------------------8<----------------------------------------------

really weird.

   Peter

>
>
> On May 18, 2010, at 6:18 AM, Peter Kruse wrote:
>
>> Hello,
>>
>> trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing
>> the component openib. System is Debian GNU/Linux 5.0.4.
>> The command to start the job (under Torque 2.4.7) was:
>>
>> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2
>> ./IMB-MPI1 -npmin 2 PingPong
>>
>> and results in these messages:
>>
>> ----------------------------8<----------------------------------------------
>>
>> [beo-15:20933] mca: base: components_open: Looking for btl components
>> [beo-16:20605] mca: base: components_open: Looking for btl components
>> [beo-15:20933] mca: base: components_open: opening btl components
>> [beo-15:20933] mca: base: components_open: found loaded component openib
>> [beo-15:20933] mca: base: components_open: component openib has no register
>> function
>> [beo-15:20933] mca: base: components_open: component openib open function
>> successful
>> [beo-15:20933] mca: base: components_open: found loaded component self
>> [beo-15:20933] mca: base: components_open: component self has no register function
>> [beo-15:20933] mca: base: components_open: component self open function successful
>> [beo-16:20605] mca: base: components_open: opening btl components
>> [beo-16:20605] mca: base: components_open: found loaded component openib
>> [beo-16:20605] mca: base: components_open: component openib has no register
>> function
>> [beo-16:20605] mca: base: components_open: component openib open function
>> successful
>> [beo-16:20605] mca: base: components_open: found loaded component self
>> [beo-16:20605] mca: base: components_open: component self has no register function
>> [beo-16:20605] mca: base: components_open: component self open function successful
>> [beo-15:20933] select: initializing btl component openib
>> [beo-15:20933] select: init of component openib returned failure
>> [beo-15:20933] select: module openib unloaded
>> [beo-15:20933] select: initializing btl component self
>> [beo-15:20933] select: init of component self returned success
>> [beo-16:20605] select: initializing btl component openib
>> [beo-16:20605] select: init of component openib returned failure
>> [beo-16:20605] select: module openib unloaded
>> [beo-16:20605] select: initializing btl component self
>> [beo-16:20605] select: init of component self returned success
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[4887,1],0]) is on host: beo-15
>> Process 2 ([[4887,1],1]) is on host: beo-16
>> BTLs attempted: self
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init_thread
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [beo-15:20933] Abort before MPI_INIT completed successfully; not able to
>> guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> orterun has exited due to process rank 0 with PID 20933 on
>> node beo-15 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by orterun (as reported here).
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init_thread
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [beo-16:20605] Abort before MPI_INIT completed successfully; not able to
>> guarantee that all other processes were killed!
>> [beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt /
>> unreachable proc
>> [beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
>> help / error messages
>> [beo-15:20930] 1 more process has sent help message help-mpi-runtime /
>> mpi_init:startup:internal-failure
>>
>> ----------------------------8<----------------------------------------------
>>
>> running another Benchmark (OSU) succeeds in loading the openib component.
>>
>> "ibstat |grep -i state" on both nodes gives:
>>
>> ----------------------------8<----------------------------------------------
>> State: Active
>> Physical state: LinkUp
>> ----------------------------8<----------------------------------------------
>>
>> Running with "mpi_abort_delay -1" and attaching an strace on the process
>> is not very helpful it loops with:
>>
>> ----------------------------8<----------------------------------------------
>> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
>> rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2aee59d44f60}, 8) = 0
>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>> nanosleep({5, 0}, {5, 0}) = 0
>> ----------------------------8<----------------------------------------------
>>
>> Does anybody have an idea what is wrong or how can we get more debugging
>> information about the initialization of the openib module?
>>
>> Thanks for any help,
>>
>> Peter
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>