Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI_Alltoall problem: error creating qp
From: Shao-Ching Huang (pittzml_at_[hidden])
Date: 2009-08-21 20:02:43


Hi

We are getting the following kind of error messages when trying to run
MPI_alltoall on 170 nodes with slots=8 on each node (i.e. 170*8=1360
MPI processes in total):

 $ mpiexec -n 1360 -hostfile ./mach.8 ./a.out
...
[n2154][[20427,1],1180][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:463:qp_create_one]
error creating qp errno says Cannot allocate memory
[n2154][[20427,1],1183][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:809:rml_recv_cb]
error in endpoint reply start connect

The entire error message is attached. In this case, the alltoall
buffer size of each process is approximately 8MB. The same
program/same data size runs fine on 169 nodes (169*8=1352 processes)
or less. The same program will run on 170 nodes (*8=1360 processes)
if we reduce the alltoall buffer size to below a certain value.

What are the OpenMPI parameters that we need to adjust in order to run
high process count (alltoall) jobs?

The output of ibv_devinfo, ompi_info and uname are appended below. Thanks,

Shao-Ching

 $ uname -a
Linux n2001 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux

-------------------------------------------------------------------------

$ ibv_devinfo
hca_id: mthca0
        fw_ver: 5.3.0
        node_guid: 0023:7dff:ff93:1dec
        sys_image_guid: 0023:7dff:ff93:1def
        vendor_id: 0x02c9
        vendor_part_id: 25218
        hw_ver: 0x20
        board_id: HP_0090010002
        phys_port_cnt: 2
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 2048 (4)
                        active_mtu: 2048 (4)
                        sm_lid: 2
                        port_lid: 401
                        port_lmc: 0x00

-------------------------------------------------------------------------

$ ompi_info

Package: Open MPI sch_at_n2041 Distribution
Open MPI: 1.3.3
  Open MPI SVN revision: r21666
   Open MPI release date: Jul 14, 2009
                Open RTE: 1.3.3
   Open RTE SVN revision: r21666
   Open RTE release date: Jul 14, 2009
                    OPAL: 1.3.3
       OPAL SVN revision: r21666
       OPAL release date: Jul 14, 2009
            Ident string: 1.3.3
                  Prefix: /u/home2/sch/local/openmpi/1.3.3-icc11
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: n2041
           Configured by: sch
           Configured on: Mon Aug 17 10:37:46 PDT 2009
          Configure host: n2041
                Built by: sch
                Built on: Mon Aug 17 10:42:20 PDT 2009
              Built host: n2041
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: icc
     C compiler absolute: /u/local/compilers/intel/11.1/046/bin/intel64/icc
            C++ compiler: icpc
   C++ compiler absolute: /u/local/compilers/intel/11.1/046/bin/intel64/icpc
      Fortran77 compiler: ifort
  Fortran77 compiler abs: /u/local/compilers/intel/11.1/046/bin/intel64/ifort
      Fortran90 compiler: ifort
  Fortran90 compiler abs: /u/local/compilers/intel/11.1/046/bin/intel64/ifort
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no (checkpoint thread: no)
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
              MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
           MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
                  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)