Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Openmpi not using IB and no warning message
From: Sangamesh B (forum.san_at_[hidden])
Date: 2009-10-12 03:38:15


Any hint for the previous mail?

Does Open MPI-1.3.3 support only a limited versions of OFED?
Or any version is ok?
On Sun, Oct 11, 2009 at 3:55 PM, Sangamesh B <forum.san_at_[hidden]> wrote:

> Hi,
>
> A fortran application is installed with Intel Fortran 10.1, MKL-10 and
> Openmpi-1.3.3 on a Rocks-5.1 HPC Linux cluster. The jobs are not scaling
> when more than one node is used. The cluster has Intel Quad core Xeon
> (E5472) @ 3.00GHz Dual processor (total 8 cores per node, 16GB RAM) and
> Infiniband interconnectivity.
>
> Here are some of the timings:
>
> 12 cores (Node 1: 8 cores, Node2: 4 cores) -- No progress in the job
> 8 cores (Node 1: 8 cores) -- 21 hours (38 CG
> move steps)
> 4 cores (Node 1: 4 cores) -- 25 hours
> 12 cores (Node 1, Node 2, Node 3: 4cores each) -- No progress
>
>
> Later to check, whether Open MPI is using IB or not, I used --mca btl
> openib. But the job failed with following error message:
> # cat /home1/g03/apps_test/amber/test16/err.352.job16
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[23671,1],12]) is on host: compute-0-12.local
> Process 2 ([[23671,1],12]) is on host: compute-0-12.local
> BTLs attempted: openib
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-12.local:5496] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6916] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6914] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6915] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> [compute-0-5.local:6913] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> mpirun has exited due to process rank 12 with PID 5496 on
> node compute-0-12.local exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [compute-0-5.local:06910] 15 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc
> [compute-0-5.local:06910] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> [compute-0-5.local:06910] 15 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:internal-failure
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[23958,1],2]) is on host: compute-0-5.local
> Process 2 ([[23958,1],2]) is on host: compute-0-5.local
> BTLs attempted: openib
>
> Then added 'self' to --mca btl openib,. With this it started running, but I
> can make sure its not using IB as I observed it from the netstat -i command.
>
> 1st Snap:
>
> Every 2.0s: netstat
> -i
> Sun Oct 11 15:29:29 2009
>
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
> TX-OVR Flg
> eth0 1500 0 1847619 0 0 0 2073010 0 0 0
> BMRU
> ib0 65520 0 708 0 0 0 509 0 5 0
> BMRU
> lo 16436 0 5731 0 0 0 5731 0 0 0
> LRU
>
> 2nd Snap:
>
> Every 2.0s: netstat
> -i
> Sun Oct 11 15:29:57 2009
>
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
> TX-OVR Flg
> eth0 1500 0 1847647 0 0 0 2073073 0 0 0
> BMRU
> ib0 65520 0 708 0 0 0 509 0 5 0
> BMRU
> lo 16436 0 5731 0 0 0 5731 0 0 0
> LRU
>
> Why OpenMPI is not able to use IB?
>
> The ldd to the executable shows, no IB libraries are linked. Is this the
> reason:
> ldd /opt/apps/siesta/siesta_mpi
> /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_lp64.so(0x00002aaaaaaad000)
> /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_thread.so(0x00002aaaaadc2000)
> /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_core.so(0x00002aaaab2ad000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x00000034a6200000)
> libmpi_f90.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f90.so.0
> (0x00002aaaab4a0000)
> libmpi_f77.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f77.so.0
> (0x00002aaaab6a3000)
> libmpi.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi.so.0
> (0x00002aaaab8db000)
> libopen-rte.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-rte.so.0
> (0x00002aaaabbaa000)
> libopen-pal.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-pal.so.0
> (0x00002aaaabe07000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00000034a5e00000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x00000034a8200000)
> libutil.so.1 => /lib64/libutil.so.1 (0x00000034a6600000)
> libifport.so.5 => /opt/intel/fce/10.1.008/lib/libifport.so.5
> (0x00002aaaac09a000)
> libifcoremt.so.5 => /opt/intel/fce/10.1.008/lib/libifcoremt.so.5
> (0x00002aaaac1d0000)
> libimf.so => /opt/intel/cce/10.1.018/lib/libimf.so (0x00002aaaac401000)
> libsvml.so => /opt/intel/cce/10.1.018/lib/libsvml.so
> (0x00002aaaac766000)
> libm.so.6 => /lib64/libm.so.6 (0x00000034a6e00000)
> libguide.so => /opt/intel/mkl/10.0.5.025/lib/em64t/libguide.so(0x00002aaaac8f1000)
> libintlc.so.5 => /opt/intel/cce/10.1.018/lib/libintlc.so.5
> (0x00002aaaaca65000)
> libc.so.6 => /lib64/libc.so.6 (0x00000034a5a00000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000034a7e00000)
> /lib64/ld-linux-x86-64.so.2 (0x00000034a5600000)
>
> With the help of Open MPI FAQ:
>
> # /opt/mpi/openmpi/1.3.3/intel/bin/ompi_info --param btl openib
> MCA btl: parameter "btl_base_verbose" (current value: "0",
> data source: default value)
> Verbosity level of the BTL framework
> MCA btl: parameter "btl" (current value: <none>, data
> source: default value)
> Default selection set of components for the btl
> framework (<none> means use all components that can be found)
> MCA btl: parameter "btl_openib_verbose" (current value:
> "0", data source: default value)
> Output some verbose OpenIB BTL information (0 =
> no output, nonzero = output)
> MCA btl: parameter
> "btl_openib_warn_no_device_params_found" (current value: "1", data source:
> default value, synonyms:
> btl_openib_warn_no_hca_params_found)
> Warn when no device-specific parameters are found
> in the INI file specified by the btl_openib_device_param_files MCA parameter
> (0 =
> do not warn; any other value = warn)
> MCA btl: parameter "btl_openib_warn_no_hca_params_found"
> (current value: "1", data source: default value, deprecated, synonym of:
> btl_openib_warn_no_device_params_found)
> Warn when no device-specific parameters are found
> in the INI file specified by the btl_openib_device_param_files MCA parameter
> (0 =
> do not warn; any other value = warn)
> MCA btl: parameter "btl_openib_warn_default_gid_prefix"
> (current value: "1", data source: default value)
> Warn when there is more than one active ports and
> at least one of them connected to the network with only default GID prefix
> configured (0 = do not warn; any other value =
> warn)
> MCA btl: parameter "btl_openib_warn_nonexistent_if"
> (current value: "1", data source: default value)
> Warn if non-existent devices and/or ports are
> specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not warn;
> any
> other value = warn)
>
> During Open MPI install I've used --with-openib=/usr. So I believe its
> compiled with IB support.
>
> The IB utilities such as ibv_rc_pingpong are working fine.
>
> I'm not getting why its OMPI is not using IB? Please help me to resolve
> this issue.
>
> Thanks
>