Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] init of component openib returned failure
From: Peter Kruse (pk_at_[hidden])
Date: 2010-05-19 07:18:41


Hi,

Jeff Squyres (jsquyres) wrote:
> Ok, we've entered the Land of Really Weird - I've never seen a btl work with one mpi app and not another.
>
> Some q's:
>
> - are you running both apps on the same nodes?

yes, in fact I'm running them in the same interactive Job.

> - is anything else running on the nodes at the same time (e.g., other mpi jobs using openfabrics)?

no, the nodes are reserved for testing this at the moment.

> - is the imb compiled for ompi 1.4.1?

yes it is.

> - can you run ldd on the apps to ensure they're linking to the same libmpi?

----------------------------8<----------------------------------------------

$ ldd IMB-MPI1
         linux-vdso.so.1 => (0x00007fff077ff000)
         libmpi.so.0 => /usr/lib/openmpi/1.4.1/gcc/lib/libmpi.so.0
(0x00002b9120a3a000)
         libopen-rte.so.0 => /usr/lib/openmpi/1.4.1/gcc/lib/libopen-rte.so.0
(0x00002b9120cf4000)
         libopen-pal.so.0 => /usr/lib/openmpi/1.4.1/gcc/lib/libopen-pal.so.0
(0x00002b9120f43000)
         libdl.so.2 => /lib/libdl.so.2 (0x00002b91211c6000)
         libnsl.so.1 => /lib/libnsl.so.1 (0x00002b91213ca000)
         libutil.so.1 => /lib/libutil.so.1 (0x00002b91215e2000)
         libm.so.6 => /lib/libm.so.6 (0x00002b91217e6000)
         libpthread.so.0 => /lib/libpthread.so.0 (0x00002b9121a69000)
         libc.so.6 => /lib/libc.so.6 (0x00002b9121c85000)
         /lib64/ld-linux-x86-64.so.2 (0x00002b912081d000)
$ cd ../../osu_benchmarks/
$ ldd osu_lat_ompi-1.4.1
         linux-vdso.so.1 => (0x00007ffff65ff000)
         libmpi.so.0 => /usr/lib/openmpi/1.4.1/gcc/lib/libmpi.so.0
(0x00002b4f69ec8000)
         libopen-rte.so.0 => /usr/lib/openmpi/1.4.1/gcc/lib/libopen-rte.so.0
(0x00002b4f6a182000)
         libopen-pal.so.0 => /usr/lib/openmpi/1.4.1/gcc/lib/libopen-pal.so.0
(0x00002b4f6a3d1000)
         libdl.so.2 => /lib/libdl.so.2 (0x00002b4f6a654000)
         libnsl.so.1 => /lib/libnsl.so.1 (0x00002b4f6a858000)
         libutil.so.1 => /lib/libutil.so.1 (0x00002b4f6aa70000)
         libm.so.6 => /lib/libm.so.6 (0x00002b4f6ac74000)
         libpthread.so.0 => /lib/libpthread.so.0 (0x00002b4f6aef7000)
         libc.so.6 => /lib/libc.so.6 (0x00002b4f6b113000)
         /lib64/ld-linux-x86-64.so.2 (0x00002b4f69cab000)

----------------------------8<----------------------------------------------

>
> -jms
> Sent from my PDA. No type good.

thanks for going through this trouble to reply!

Peter

>
> ----- Original Message -----
> From: users-bounces_at_[hidden] <users-bounces_at_[hidden]>
> To: Open MPI Users <users_at_[hidden]>
> Sent: Wed May 19 02:45:58 2010
> Subject: Re: [OMPI users] init of component openib returned failure
>
> Hello,
>
> thanks for your reply.
>
> Jeff Squyres wrote:
>> Try running with:
>>
>> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2 --mca btl_openib_verbose 100 ./IMB-MPI1 -npmin 2 PingPong
>
> the output is exactly the same as before.
>
>> Also, are you saying that running the same command line with osu_latency works just fine? That would be really weird...
>
> Yes, if I run:
>
> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2
> --mca btl_openib_verbose 100 ./osu_lat_ompi-1.4.1
>
> the openib component can be initialized:
>
> ----------------------------8<----------------------------------------------
>
> [beo-15:29479] mca: base: components_open: Looking for btl components
> [beo-16:29063] mca: base: components_open: Looking for btl components
> [beo-15:29479] mca: base: components_open: opening btl components
> [beo-15:29479] mca: base: components_open: found loaded component openib
> [beo-15:29479] mca: base: components_open: component openib has no register
> function
> [beo-15:29479] mca: base: components_open: component openib open function
> successful
> [beo-15:29479] mca: base: components_open: found loaded component self
> [beo-15:29479] mca: base: components_open: component self has no register function
> [beo-15:29479] mca: base: components_open: component self open function successful
> [beo-16:29063] mca: base: components_open: opening btl components
> [beo-16:29063] mca: base: components_open: found loaded component openib
> [beo-16:29063] mca: base: components_open: component openib has no register
> function
> [beo-16:29063] mca: base: components_open: component openib open function
> successful
> [beo-16:29063] mca: base: components_open: found loaded component self
> [beo-16:29063] mca: base: components_open: component self has no register function
> [beo-16:29063] mca: base: components_open: component self open function successful
> [beo-15:29479] select: initializing btl component openib
> [beo-16:29063] select: initializing btl component openib
> [beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
> INI files for vendor 0x02c9, part ID 25204
> [beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
> corresponding INI values: Mellanox Sinai Infinihost III
> [beo-15][[12785,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
> INI files for vendor 0x0000, part ID 0
> [beo-15][[12785,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
> corresponding INI values: default
> [beo-15:29479] openib BTL: oob CPC available for use on mthca0:1
> [beo-15:29479] openib BTL: xoob CPC only supported with XRC receive queues;
> skipped on mthca0:1
> [beo-15:29479] openib BTL: rdmacm CPC available for use on mthca0:1
> [beo-15:29479] select: init of component openib returned success
> [beo-15:29479] select: initializing btl component self
> [beo-15:29479] select: init of component self returned success
> [beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
> INI files for vendor 0x02c9, part ID 25204
> [beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
> corresponding INI values: Mellanox Sinai Infinihost III
> [beo-16][[12785,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying
> INI files for vendor 0x0000, part ID 0
> [beo-16][[12785,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found
> corresponding INI values: default
> [beo-16:29063] openib BTL: oob CPC available for use on mthca0:1
> [beo-16:29063] openib BTL: xoob CPC only supported with XRC receive queues;
> skipped on mthca0:1
> [beo-16:29063] openib BTL: rdmacm CPC available for use on mthca0:1
> [beo-16:29063] select: init of component openib returned success
> [beo-16:29063] select: initializing btl component self
> [beo-16:29063] select: init of component self returned success
> # OSU MPI Latency Test (Version 2.2)
> # Size Latency (us)
> [beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> [beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> [beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> [beo-16][[12785,1],1][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> [beo-15][[12785,1],0][connect/btl_openib_connect_oob.c:313:qp_connect_all] Set
> MTU to IBV value 4 (2048 bytes)
> 0 3.57
> 1 3.65
> 2 3.63
> 4 3.64
> 8 3.68
> 16 3.72
> 32 3.77
> 64 3.95
> 128 4.95
> 256 5.36
> 512 6.03
> 1024 7.64
> 2048 9.95
> 4096 12.78
> 8192 18.22
> 16384 25.48
> 32768 37.03
> 65536 60.21
> 131072 107.90
> 262144 201.18
> 524288 389.08
> 1048576 762.38
> 2097152 1510.91
> 4194304 3005.72
> [beo-15:29479] mca: base: close: component openib closed
> [beo-16:29063] mca: base: close: component openib closed
> [beo-16:29063] mca: base: close: unloading component openib
> [beo-15:29479] mca: base: close: unloading component openib
> [beo-16:29063] mca: base: close: component self closed
> [beo-16:29063] mca: base: close: unloading component self
> [beo-15:29479] mca: base: close: component self closed
> [beo-15:29479] mca: base: close: unloading component self
>
>
> ----------------------------8<----------------------------------------------
>
> really weird.
>
> Peter
>
>>
>> On May 18, 2010, at 6:18 AM, Peter Kruse wrote:
>>
>>> Hello,
>>>
>>> trying to run Intel MPI Benchmarks with OpenMPI 1.4.1 fails in initializing
>>> the component openib. System is Debian GNU/Linux 5.0.4.
>>> The command to start the job (under Torque 2.4.7) was:
>>>
>>> mpirun.openmpi-1.4.1 --mca btl_base_verbose 50 --mca btl self,openib -n 2
>>> ./IMB-MPI1 -npmin 2 PingPong
>>>
>>> and results in these messages:
>>>
>>> ----------------------------8<----------------------------------------------
>>>
>>> [beo-15:20933] mca: base: components_open: Looking for btl components
>>> [beo-16:20605] mca: base: components_open: Looking for btl components
>>> [beo-15:20933] mca: base: components_open: opening btl components
>>> [beo-15:20933] mca: base: components_open: found loaded component openib
>>> [beo-15:20933] mca: base: components_open: component openib has no register
>>> function
>>> [beo-15:20933] mca: base: components_open: component openib open function
>>> successful
>>> [beo-15:20933] mca: base: components_open: found loaded component self
>>> [beo-15:20933] mca: base: components_open: component self has no register function
>>> [beo-15:20933] mca: base: components_open: component self open function successful
>>> [beo-16:20605] mca: base: components_open: opening btl components
>>> [beo-16:20605] mca: base: components_open: found loaded component openib
>>> [beo-16:20605] mca: base: components_open: component openib has no register
>>> function
>>> [beo-16:20605] mca: base: components_open: component openib open function
>>> successful
>>> [beo-16:20605] mca: base: components_open: found loaded component self
>>> [beo-16:20605] mca: base: components_open: component self has no register function
>>> [beo-16:20605] mca: base: components_open: component self open function successful
>>> [beo-15:20933] select: initializing btl component openib
>>> [beo-15:20933] select: init of component openib returned failure
>>> [beo-15:20933] select: module openib unloaded
>>> [beo-15:20933] select: initializing btl component self
>>> [beo-15:20933] select: init of component self returned success
>>> [beo-16:20605] select: initializing btl component openib
>>> [beo-16:20605] select: init of component openib returned failure
>>> [beo-16:20605] select: module openib unloaded
>>> [beo-16:20605] select: initializing btl component self
>>> [beo-16:20605] select: init of component self returned success
>>> --------------------------------------------------------------------------
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications. This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes. This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other. This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>>> Process 1 ([[4887,1],0]) is on host: beo-15
>>> Process 2 ([[4887,1],1]) is on host: beo-16
>>> BTLs attempted: self
>>>
>>> Your MPI job is now going to abort; sorry.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort. There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or environment
>>> problems. This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>>
>>> PML add procs failed
>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>> --------------------------------------------------------------------------
>>> *** An error occurred in MPI_Init_thread
>>> *** before MPI was initialized
>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>> [beo-15:20933] Abort before MPI_INIT completed successfully; not able to
>>> guarantee that all other processes were killed!
>>> --------------------------------------------------------------------------
>>> orterun has exited due to process rank 0 with PID 20933 on
>>> node beo-15 exiting without calling "finalize". This may
>>> have caused other processes in the application to be
>>> terminated by signals sent by orterun (as reported here).
>>> --------------------------------------------------------------------------
>>> *** An error occurred in MPI_Init_thread
>>> *** before MPI was initialized
>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>> [beo-16:20605] Abort before MPI_INIT completed successfully; not able to
>>> guarantee that all other processes were killed!
>>> [beo-15:20930] 1 more process has sent help message help-mca-bml-r2.txt /
>>> unreachable proc
>>> [beo-15:20930] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
>>> help / error messages
>>> [beo-15:20930] 1 more process has sent help message help-mpi-runtime /
>>> mpi_init:startup:internal-failure
>>>
>>> ----------------------------8<----------------------------------------------
>>>
>>> running another Benchmark (OSU) succeeds in loading the openib component.
>>>
>>> "ibstat |grep -i state" on both nodes gives:
>>>
>>> ----------------------------8<----------------------------------------------
>>> State: Active
>>> Physical state: LinkUp
>>> ----------------------------8<----------------------------------------------
>>>
>>> Running with "mpi_abort_delay -1" and attaching an strace on the process
>>> is not very helpful it loops with:
>>>
>>> ----------------------------8<----------------------------------------------
>>> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
>>> rt_sigaction(SIGCHLD, NULL, {0x2aee58ff3250, [CHLD], SA_RESTORER|SA_RESTART,
>>> 0x2aee59d44f60}, 8) = 0
>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>> nanosleep({5, 0}, {5, 0}) = 0
>>> ----------------------------8<----------------------------------------------
>>>
>>> Does anybody have an idea what is wrong or how can we get more debugging
>>> information about the initialization of the openib module?
>>>
>>> Thanks for any help,
>>>
>>> Peter
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>