Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2007-07-17 03:14:44


Hi,
Try to increase the IB time out parameter: --mca btl_mvapi_ib_timeout 14
If the 14 will not work , try to increase little bit more (16)

Thanks,
Pasha

Neil Ludban wrote:
> Hi,
>
> I'm getting the errors below when calling MPI_Alltoallv() as part of
> a matrix transpose operation. It's 100% repeatable when testing with
> 16M matrix elements divided between 64 processes on 32 dual core nodes.
> There are never any errors with fewer processes or elements, including
> the same 32 nodes with only one process per node. If anyone wants
> any additional information or has suggestions to try, please let me
> know. Otherwise, I'll have the system rebooted and hope the problem
> goes away.
>
> -Neil
>
>
>
> [0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress]
> from c065 to: c077 [0,1,3][btl_mvapi_component.c:854:
> mca_btl_mvapi_component_progress] from c069 error polling HP
> CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag :
> 0x2ab6590200 to: c078 error polling HP CQ with status
> VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380
> --------------------------------------------------------------------------
> The retry count is a down counter initialized on creation of the QP. Retry
> count is defined in the InfiniBand Spec 1.2 (12.7.38):
> The total number of times that the sender wishes the receiver to retry tim-
> eout, packet sequence, etc. errors before posting a completion error.
>
> Note that two mca parameters are involved here:
> btl_mvapi_ib_retry_count - The number of times the sender will attempt to
> retry (defaulted to 7, the maximum value).
>
> btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). The
> actual timeout value used is calculated as:
> (4.096 micro-seconds * 2^btl_mvapi_ib_timeout).
> See InfiniBand Spec 1.2 (12.7.34) for more details.
>
> What to do next:
> One item to note is the hosts on which this error has occured, it has been
> observed that rebooting or removing a particular host from the job can resolve
> this issue. Should you be able to identify a specific cause or additional
> trouble shooting information please report this to devel_at_open-mpi.org.
>
>
> % ompi_info
> Open MPI: 1.2.3
> Open MPI SVN revision: r15136
> Open RTE: 1.2.3
> Open RTE SVN revision: r15136
> OPAL: 1.2.3
> OPAL SVN revision: r15136
> Prefix: /home/nludban/ParaM-kodos-openmpi-ib-openmpi123
> Configured architecture: x86_64-unknown-linux-gnu
> Configured by: nludban
> Configured on: Mon Jul 16 11:18:27 EDT 2007
> Configure host: kodos
> Built by: nludban
> Built on: Mon Jul 16 11:27:04 EDT 2007
> Built host: kodos
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: gcc
> C compiler absolute: /usr/bin/gcc
> C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
> Fortran77 compiler: gfortran
> Fortran77 compiler abs: /usr/bin/gfortran
> Fortran90 compiler: gfortran
> Fortran90 compiler abs: /usr/bin/gfortran
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: no, progress: no)
> Internal debug support: yes
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> Heterogeneous support: yes
> mpirun default --prefix: no
> MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.3)
> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.3)
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.3)
> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.3)
> MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3)
> MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.3)
> MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.3)
> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.3)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.2.3)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.3)
> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.3)
> MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.3)
> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.3)
> MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.3)
> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.3)
> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.3)
> MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.3)
> MCA btl: mvapi (MCA v1.0, API v1.0.1, Component v1.2.3)
> MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.3)
> MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.3)
> MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
> MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.3)
> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.3)
> MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.3)
> MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.3)
> MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.3)
> MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.3)
> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.3)
> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.3)
> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.3)
> MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.3)
> MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.3)
> MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.3)
> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.3)
> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.3)
> MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.3)
> MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.3)
> MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.3)
> MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.3)
> MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.3)
> MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.3)
> MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.3)
> MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.3)
> MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.3)
> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.3)
> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.3)
> MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.3)
> MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.3)
> MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.3)
> MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.3)
> MCA sds: env (MCA v1.0, API v1.0, Component v1.2.3)
> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.3)
> MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.3)
> MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.3)
> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.3)
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>