Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] btl_openib_connect_oob.c:459:qp_create_one: errorcreating qp
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-01 08:01:03


Random thought: would it be easy for the output of cat /dev/knem to
indicate whether IOAT hardware is present?

On Jul 1, 2009, at 5:23 AM, Jose Gracia wrote:

> Dear all,
>
> I have problems running large jobs on a PC cluster with OpenMPI V1.3.
> Typically the error appears only for processor count >= 2048 (actually
> cores), sometimes also bellow.
>
> The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?)
> linux.
> $> uname -a
> Linux cl3fr1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 12:48:13 EDT 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> The code starts normally, reads it's input data sets (~4GB), does some
> initialization and then continues the actual calculations. So time
> after
> that, it fails with the following error message:
>
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one]
> error creating qp errno says Cannot allocate memory
>
> Memory usage by the application should not be the problem. At this
> proc
> count, the code uses only ~100MB per proc. Also, the code runs for
> lower
> number of procs where it consumes more mem.
>
>
> I also get the apparently secondary error messages:
>
> [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv:
> readv
> failed: Connection reset by peer (104)
>
>
> The cluster uses InfiniBand connections. I am aware only of the
> following parameter changes (systemwide):
> btl_openib_ib_min_rnr_timer = 25
> btl_openib_ib_timeout = 20
>
> $> ulimit -l
> unlimited
>
>
> I attached:
> 1) $> ompi_info --all > ompi_info.log
> 2) stderr from the PBS: stderr.log
>
>
> Thanks for any help you may give!
>
> Cheers,
> Jose
>
>
> <ompi_info.log.gz>+ export OMP_NUM_THREADS=1
> + OMP_NUM_THREADS=1
> + module load compiler/intel mpi/openmpi/1.3-intel-11.0
> ++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash load
> compiler/intel mpi/openmpi/1.3-intel-11.0
> + eval LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/
> local/lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/
> compiler/intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/
> 11.0.074/lib/intel64 ';export' 'LD_LIBRARY_PATH;LOADEDMODULES=system/
> maui/3.2.6p21:compiler/intel/11.0:mpi/openmpi/1.3-intel-11.0'
> ';export' 'LOADEDMODULES;MANPATH=/usr/local/man::/opt/system/modules/
> default/man:/opt/compiler/intel//cc/11.0.074/man:/opt/compiler/
> intel//fc/11.0.074/man:/opt/mpi/openmpi/1.3-intel-11.0/man'
> ';export' 'MANPATH;MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0' ';export'
> 'MPIDIR;MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin' ';export'
> 'MPI_BIN_DIR;MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include'
> ';export' 'MPI_INC_DIR;MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/
> lib' ';export' 'MPI_LIB_DIR;MPI_MAN_DIR=/opt/mpi/openmpi/1.3-
> intel-11.0/man' ';export' 'MPI_MAN_DIR;MPI_VERSION=1.3-intel-11.0'
> ';export' 'MPI_VERSION;NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/
> intel64/locale/%l_%t/%N' ';export' 'NLSPATH;PATH=/opt/mpi/openmpi/
> 1.3-intel-11.0/bin:/opt/compiler/intel//fc/11.0.074/bin/intel64:/opt/
> compiler/intel//java/jre1.6.0_14/bin:/opt/compiler/intel//cc/
> 11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/hpcjgrac/bin:/usr/local/
> bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/3.2.6p21/bin:/usr/
> kerberos/bin:/bin:/usr/bin' ';export' 'PATH;_LMFILES_=/opt/system/
> modulefiles/system/maui/3.2.6p21:/opt/modulefiles/compiler/intel/
> 11.0:/opt/modulefiles/mpi/openmpi/1.3-intel-11.0' ';export'
> '_LMFILES_;'
> ++ LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/local/
> lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/compiler/
> intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/11.0.074/lib/
> intel64
> ++ export LD_LIBRARY_PATH
> ++ LOADEDMODULES=system/maui/3.2.6p21:compiler/intel/11.0:mpi/
> openmpi/1.3-intel-11.0
> ++ export LOADEDMODULES
> ++ MANPATH=/usr/local/man::/opt/system/modules/default/man:/opt/
> compiler/intel//cc/11.0.074/man:/opt/compiler/intel//fc/11.0.074/
> man:/opt/mpi/openmpi/1.3-intel-11.0/man
> ++ export MANPATH
> ++ MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0
> ++ export MPIDIR
> ++ MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin
> ++ export MPI_BIN_DIR
> ++ MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include
> ++ export MPI_INC_DIR
> ++ MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/lib
> ++ export MPI_LIB_DIR
> ++ MPI_MAN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/man
> ++ export MPI_MAN_DIR
> ++ MPI_VERSION=1.3-intel-11.0
> ++ export MPI_VERSION
> ++ NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/intel64/locale/%l_%t/
> %N
> ++ export NLSPATH
> ++ PATH=/opt/mpi/openmpi/1.3-intel-11.0/bin:/opt/compiler/intel//fc/
> 11.0.074/bin/intel64:/opt/compiler/intel//java/jre1.6.0_14/bin:/opt/
> compiler/intel//cc/11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/
> hpcjgrac/bin:/usr/local/bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/
> 3.2.6p21/bin:/usr/kerberos/bin:/bin:/usr/bin
> ++ export PATH
> ++ _LMFILES_=/opt/system/modulefiles/system/maui/3.2.6p21:/opt/
> modulefiles/compiler/intel/11.0:/opt/modulefiles/mpi/openmpi/1.3-
> intel-11.0
> ++ export _LMFILES_
> + module list
> ++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash list
> Currently Loaded Modulefiles:
> 1) system/maui/3.2.6p21 3) mpi/openmpi/1.3-intel-11.0
> 2) compiler/intel/11.0
> + eval
> + cd /nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/benchmark/
> applications/gadget/tmp/GADGET_NEHALEM-
> HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01
> ++ date
> + echo '<jobstart at="Fri Jun 19 09:50:05 CEST 2009" />'
> + mpiexec time /nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/
> benchmark/applications/gadget/tmp/GADGET_NEHALEM-
> HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01/GADGET_NEHALEM-
> HLRS_cname_NEHALEM-HLRS.exe param.txt
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],7] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],6] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],5] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],1] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],2] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n100501:14587] [[40339,0],0]-[[40339,1],3] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> [n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033201][[40339,1],1547][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201:3588] *** An error occurred in MPI_Sendrecv
> [n033201:3588] *** on communicator MPI_COMM_WORLD
> [n033201:3588] *** MPI_ERR_OTHER: known error not in list
> [n033201:3588] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [n033102][[40339,1],1538][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033102][[40339,1],1543][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1549][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033102][[40339,1],1541][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033102][[40339,1],1536][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1544][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033202][[40339,1],1553][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033202][[40339,1],1556][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply
> start connect
> [n033202][[40339,1],1558][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1559][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033202][[40339,1],1557][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno
> says Cannot allocate memory
> [n033201:03576] [[40339,0],193]-[[40339,1],1544]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033102:03498] [[40339,0],192]-[[40339,1],1538]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033102:03498] [[40339,0],192]-[[40339,1],1543]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033201:03576] [[40339,0],193]-[[40339,1],1551]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033102:03498] [[40339,0],192]-[[40339,1],1540]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033201:03576] [[40339,0],193]-[[40339,1],1549]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033202:03719] [[40339,0],194]-[[40339,1],1555]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [n033202:03719] [[40339,0],194]-[[40339,1],1552]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> Command exited with non-zero status 16
> 64.36user 3.48system 1:20.39elapsed 84%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (7major+125286minor)pagefaults 0swaps
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1538 with PID 3501 on
> node n033102 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [n100501:14587] 11 more processes have sent help message help-mpi-
> errors.txt / mpi_errors_are_fatal
> [n100501:14587] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
> ++ date
> + echo '<jobend at="Fri Jun 19 09:51:27 CEST 2009" />'
> <ATT3807088.txt>

-- 
Jeff Squyres
Cisco Systems