Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Neil Ludban (nludban_at_[hidden])
Date: 2007-07-16 18:51:25


Hi,

I'm getting the errors below when calling MPI_Alltoallv() as part of
a matrix transpose operation. It's 100% repeatable when testing with
16M matrix elements divided between 64 processes on 32 dual core nodes.
There are never any errors with fewer processes or elements, including
the same 32 nodes with only one process per node. If anyone wants
any additional information or has suggestions to try, please let me
know. Otherwise, I'll have the system rebooted and hope the problem
goes away.

-Neil

[0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress]
        from c065 to: c077 [0,1,3][btl_mvapi_component.c:854:
        mca_btl_mvapi_component_progress] from c069 error polling HP
        CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag :
        0x2ab6590200 to: c078 error polling HP CQ with status
        VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380
--------------------------------------------------------------------------
The retry count is a down counter initialized on creation of the QP. Retry
count is defined in the InfiniBand Spec 1.2 (12.7.38):
The total number of times that the sender wishes the receiver to retry tim-
eout, packet sequence, etc. errors before posting a completion error.

Note that two mca parameters are involved here:
btl_mvapi_ib_retry_count - The number of times the sender will attempt to
retry (defaulted to 7, the maximum value).

btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). The
actual timeout value used is calculated as:
(4.096 micro-seconds * 2^btl_mvapi_ib_timeout).
See InfiniBand Spec 1.2 (12.7.34) for more details.

What to do next:
One item to note is the hosts on which this error has occured, it has been
observed that rebooting or removing a particular host from the job can resolve
this issue. Should you be able to identify a specific cause or additional
trouble shooting information please report this to devel_at_open-mpi.org.

% ompi_info
                Open MPI: 1.2.3
   Open MPI SVN revision: r15136
                Open RTE: 1.2.3
   Open RTE SVN revision: r15136
                    OPAL: 1.2.3
       OPAL SVN revision: r15136
                  Prefix: /home/nludban/ParaM-kodos-openmpi-ib-openmpi123
 Configured architecture: x86_64-unknown-linux-gnu
           Configured by: nludban
           Configured on: Mon Jul 16 11:18:27 EDT 2007
          Configure host: kodos
                Built by: nludban
                Built on: Mon Jul 16 11:27:04 EDT 2007
              Built host: kodos
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
      Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
  Internal debug support: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
           MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.3)
              MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.3)
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3)
           MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.3)
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.3)
               MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3)
         MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.3)
         MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.3)
           MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.3)
                MCA coll: self (MCA v1.0, API v1.0, Component v1.2.3)
                MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.3)
                MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.3)
               MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.3)
               MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.3)
              MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA btl: mvapi (MCA v1.0, API v1.0.1, Component v1.2.3)
                 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.3)
                 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.3)
                 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
                MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.3)
              MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.3)
              MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.3)
              MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.3)
                  MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.3)
                  MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.3)
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.3)
               MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.3)
                MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.3)
                MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.3)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.3)
                 MCA sds: env (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.3)
                 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.3)