Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] btl_openib_connect_oob.c:459:qp_create_one: errorcreating qp
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-01 08:01:03


Random thought: would it be easy for the output of cat /dev/knem to
indicate whether IOAT hardware is present?

On Jul 1, 2009, at 5:23 AM, Jose Gracia wrote:

> Dear all,
>
> I have problems running large jobs on a PC cluster with OpenMPI V1.3.
> Typically the error appears only for processor count >= 2048 (actually
> cores), sometimes also bellow.
>
> The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?)
> linux.
> $> uname -a
> Linux cl3fr1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 12:48:13 EDT 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> The code starts normally, reads it's input data sets (~4GB), does some
> initialization and then continues the actual calculations. So time
> after
> that, it fails with the following error message:
>
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one]
> error creating qp errno says Cannot allocate memory
>
> Memory usage by the application should not be the problem. At this
> proc
> count, the code uses only ~100MB per proc. Also, the code runs for
> lower
> number of procs where it consumes more mem.
>
>
> I also get the apparently secondary error messages:
>
> [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv:
> readv
> failed: Connection reset by peer (104)
>
>
> The cluster uses InfiniBand connections. I am aware only of the
> following parameter changes (systemwide):
> btl_openib_ib_min_rnr_timer = 25
> btl_openib_ib_timeout = 20
>
> $> ulimit -l
> unlimited
>
>
> I attached:
> 1) $> ompi_info --all > ompi_info.log
> 2) stderr from the PBS: stderr.log
>
>
> Thanks for any help you may give!
>
> Cheers,
> Jose
>
>
> <ompi_info.log.gz>+ export OMP_NUM_THREADS=1
> + OMP_NUM_THREADS=1
> + module load compiler/intel mpi/openmpi/1.3-intel-11.0
> ++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash load
> compiler/intel mpi/openmpi/1.3-intel-11.0
> + eval LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/
> local/lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/
> compiler/intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/
> 11.0.074/lib/intel64 ';export' 'LD_LIBRARY_PATH;LOADEDMODULES=system/
> maui/3.2.6p21:compiler/intel/11.0:mpi/openmpi/1.3-intel-11.0'
> ';export' 'LOADEDMODULES;MANPATH=/usr/local/man::/opt/system/modules/
> default/man:/opt/compiler/intel//cc/11.0.074/man:/opt/compiler/
> intel//fc/11.0.074/man:/opt/mpi/openmpi/1.3-intel-11.0/man'
> ';export' 'MANPATH;MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0' ';export'
> 'MPIDIR;MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin' ';export'
> 'MPI_BIN_DIR;MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include'
> ';export' 'MPI_INC_DIR;MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/
> lib' ';export' 'MPI_LIB_DIR;MPI_MAN_DIR=/opt/mpi/openmpi/1.3-
> intel-11.0/man' ';export' 'MPI_MAN_DIR;MPI_VERSION=1.3-intel-11.0'
> ';export' 'MPI_VERSION;NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/
> intel64/locale/%l_%t/%N' ';export' 'NLSPATH;PATH=/opt/mpi/openmpi/
> 1.3-intel-11.0/bin:/opt/compiler/intel//fc/11.0.074/bin/intel64:/opt/
> compiler/intel//java/jre1.6.0_14/bin:/opt/compiler/intel//cc/
> 11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/hpcjgrac/bin:/usr/local/
> bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/3.2.6p21/bin:/usr/
> kerberos/bin:/bin:/usr/bin' ';export' 'PATH;_LMFILES_=/opt/system/
> modulefiles/system/maui/3.2.6p21:/opt/modulefiles/compiler/intel/
> 11.0:/opt/modulefiles/mpi/openmpi/1.3-intel-11.0' ';export'
> '_LMFILES_;'
> ++ LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/local/
> lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/compiler/
> intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/11.0.074/lib/
> intel64
> ++ export LD_LIBRARY_PATH
> ++ LOADEDMODULES=system/maui/3.2.6p21:compiler/intel/11.0:mpi/
> openmpi/1.3-intel-11.0
> ++ export LOADEDMODULES
> ++ MANPATH=/usr/local/man::/opt/system/modules/default/man:/opt/
> compiler/intel//cc/11.0.074/man:/opt/compiler/intel//fc/11.0.074/
> man:/opt/mpi/openmpi/1.3-intel-11.0/man
> ++ export MANPATH
> ++ MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0
> ++ export MPIDIR
> ++ MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin
> ++ export MPI_BIN_DIR
> ++ MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include
> ++ export MPI_INC_DIR
> ++ MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/lib
> ++ export MPI_LIB_DIR
> ++ MPI_MAN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/man
> ++ export MPI_MAN_DIR
> ++ MPI_VERSION=1.3-intel-11.0
> ++ export MPI_VERSION
> ++ NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/intel64/locale/%l_%t/
> %N
> ++ export NLSPATH
> ++ PATH=/opt/mpi/openmpi/1.3-intel-11.0/bin:/opt/compiler/intel//fc/
> 11.0.074/bin/intel64:/opt/compiler/intel//java/jre1.6.0_14/bin:/opt/
> compiler/intel//cc/11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/
> hpcjgrac/bin:/usr/local/bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/
> 3.2.6p21/bin:/usr/kerberos/bin:/bin:/usr/bin
> ++ export PATH
> ++ _LMFILES_=/opt/system/modulefiles/system/maui/3.2.6p21:/opt/
> modulefiles/compiler/intel/11.0:/opt/modulefiles/mpi/openmpi/1.3-
> intel-11.0
> ++ export _LMFILES_
> + module list
> ++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash list
> Currently Loaded Modulefiles:
> 1) system/maui/3.2.6p21 3) mpi/openmpi/1.3-intel-11.0
> 2) compiler/intel/11.0
> + eval
> + cd /nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/benchmark/
> applications/gadget/tmp/GADGET_NEHALEM-
> HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01
> ++ date
> + echo '<jobstart at="Fri Jun 19 09:50:05 CEST 2009" />'
> + mpiexec time /nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/
> benchmark/applications/gadget/tmp/GADGET_NEHALEM-
> HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01/GADGET_NEHALEM-
> HLRS_cname_NEHALEM-HLRS.exe param.txt
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],7] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],6] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],5] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],1] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],2] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],3] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033201][[40339,1],1547][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201:3588] *** An error occurred in MPI_Sendrecv
> [n033201:3588] *** on communicator MPI_COMM_WORLD
> [n033201:3588] *** MPI_ERR_OTHER: known error not in list
> [n033201:3588] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [n033102][[40339,1],1538][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033102][[40339,1],1543][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1549][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033102][[40339,1],1541][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033102][[40339,1],1536][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1544][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033202][[40339,1],1553][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033202][[40339,1],1556][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033202][[40339,1],1558][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1559][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1557][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201:03576] [[40339,0],193]-[[40339,1],1544]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033102:03498] [[40339,0],192]-[[40339,1],1538]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033102:03498] [[40339,0],192]-[[40339,1],1543]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033201:03576] [[40339,0],193]-[[40339,1],1551]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033102:03498] [[40339,0],192]-[[40339,1],1540]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033201:03576] [[40339,0],193]-[[40339,1],1549]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033202:03719] [[40339,0],194]-[[40339,1],1555]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033202:03719] [[40339,0],194]-[[40339,1],1552]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> Command exited with non-zero status 16
> 64.36user 3.48system 1:20.39elapsed 84%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (7major+125286minor)pagefaults 0swaps
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1538 with PID 3501 on
> node n033102 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [n100501:14587] 11 more processes have sent help message help-mpi-
> errors.txt / mpi_errors_are_fatal
> [n100501:14587] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
> ++ date
> + echo '<jobend at="Fri Jun 19 09:51:27 CEST 2009" />'
> <ATT3807088.txt>

-- 
Jeff Squyres
Cisco Systems