Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] errors trying to run a simple mpi task
From: Ralph Castain (rhc.openmpi_at_[hidden])
Date: 2013-06-23 10:58:01


Don't include udapl - that code may well be stale

Sent from my iPhone

On Jun 23, 2013, at 3:42 AM, dani <dani_at_[hidden]> wrote:

> Hi,
>
> I've encountered strange issues when trying to run a simple mpi job on a single host which has IB.
> The complete errors:
>
>> -> mpirun -n 1 hello
>> --------------------------------------------------------------------------
>> WARNING: Failed to open "ofa-v2-mlx4_0-1" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
>> This may be a real error or it may be an invalid entry in the uDAPL
>> Registry which is contained in the dat.conf file. Contact your local
>> System Administrator to confirm the availability of the interfaces in
>> the dat.conf file.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> [[53031,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>>
>> Module: uDAPL
>> Host: n01
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory. This can cause MPI jobs to
>> run with erratic performance, hang, and/or crash.
>>
>> This may be caused by your OpenFabrics vendor limiting the amount of
>> physical memory that can be registered. You should investigate the
>> relevant Linux kernel module parameters that control how much physical
>> memory can be registered, and increase them to allow registering all
>> physical memory on your machine.
>>
>> See this Open MPI FAQ item for more information on these Linux kernel module
>> parameters:
>>
>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>> Local host: n01
>> Registerable memory: 32768 MiB
>> Total memory: 65503 MiB
>>
>> Your MPI job will continue, but may be behave poorly and/or hang.
>> --------------------------------------------------------------------------
>> Process 0 on n01 out of 1
>> [n01:13534] 7 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
>> [n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> Following my setup and other info:
> OS: CentOS 6.3 x86_64
> installed ofed 3.5 from source ( ./install.pl --all)
> installed openmpi 1.6.4 with the following build parameters:
>> rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define '_prefix /opt/openmpi/1.6.4/gcc' --define '_defaultdocdir /opt/openmpi/1.6.4/gcc' --define '_mandir %{_prefix}/share/man' --define '_datadir %{_prefix}/share' --define 'configure_options --with-openib=/usr --with-openib-libdir=/usr/lib64 CC=gcc CXX=g++ F77=gfortran FC=gfortran --enable-mpirun-prefix-by-default --target=x86_64-unknown-linux-gnu --with-hwloc=/usr/local --with-libltdl --enable-branch-probabilities --with-udapl --with-sge --disable-vt' --define 'use_default_rpm_opt_flags 1' --define '_name openmpi-1.6.4_gcc' --define 'install_shell_scripts 1' --define 'shell_scripts_basename mpivars' --define '_usr /usr' --define 'ofed 0' 2>&1 | tee openmpi.build.sge
> (disable -vt was used due to cuda presence which is automatically linked by vt, and becomes a dependency with no matching rpm).
>
> memorylocked is unlimited:
>> ->ulimit -a
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 515028
>> max locked memory (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 1024
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority (-r) 0
>> stack size (kbytes, -s) 10240
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) 1024
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
> IB devices are present:
>> ->ibv_devinfo
>> hca_id: mlx4_0
>> transport: InfiniBand (0)
>> fw_ver: 2.9.1000
>> node_guid: 0002:c903:004d:b0e2
>> sys_image_guid: 0002:c903:004d:b0e5
>> vendor_id: 0x02c9
>> vendor_part_id: 26428
>> hw_ver: 0xB0
>> board_id: MT_0D90110009
>> phys_port_cnt: 1
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 4096 (5)
>> active_mtu: 4096 (5)
>> sm_lid: 2
>> port_lid: 53
>> port_lmc: 0x00
>> link_layer: InfiniBand
>
> the hello program source:
>> ->cat hello.c
>> #include <stdio.h>
>> #include <mpi.h>
>>
>> int main(int argc, char *argv[]) {
>> int numprocs, rank, namelen;
>> char processor_name[MPI_MAX_PROCESSOR_NAME];
>>
>> MPI_Init(&argc, &argv);
>> MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> MPI_Get_processor_name(processor_name, &namelen);
>>
>> printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
>>
>> MPI_Finalize();
>> }
> simply compiled as:
>> mpicc hello.c -o hello
>
> the IB modules seem to be present:
>> ->service openibd status
>>
>> HCA driver loaded
>>
>> Configured IPoIB devices:
>> ib0
>>
>> Currently active IPoIB devices:
>> ib0
>>
>> The following OFED modules are loaded:
>>
>> rdma_ucm
>> rdma_cm
>> ib_addr
>> ib_ipoib
>> mlx4_core
>> mlx4_ib
>> mlx4_en
>> ib_mthca
>> ib_uverbs
>> ib_umad
>> ib_sa
>> ib_cm
>> ib_mad
>> ib_core
>> iw_cxgb3
>> iw_cxgb4
>> iw_nes
>> ib_qib
>
> Can anyone help?
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users