Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Unable to run WRF on large core counts (1024+), queue pair error
From: Craig Tierney (Craig.Tierney_at_[hidden])
Date: 2009-12-17 18:14:58


I am trying to run WRF on 1024 cores with OpenMPI 1.3.3 and
1.4. I can get the code to run with 512 cores, but it crashes
at startup on 1024 cores. I am getting the following error message:

[n172][[43536,1],0][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
[n172][[43536,1],0][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect

 From google, I have tried to change the settings for btl_openib_receive_queues,
but my tries have not worked. Here was my latest try to reduce the
total queue pairs.

mpirun -np 1024 \
    -mca btl_openib_receive_queues P,128,2048,128,128:S,65536,256,192,128 \
   `wrf.exe

These settings did not help.

Am I looking in the right place?

System setup:
Centos-5.3
Ofed-1.4.1
Intel Compiler 11.1.038
Openmpi-1.3.3 and 1.4

Build options:

./configure CC=icc CXX=icpc F77=ifort F90=ifort FC=ifort --prefix=/opt/openmpi/1.3.3-intel --without-sge --with-openib --enable-io-romio
--with-io-romio-flags=--with-file-system=lustre --with-pic

Thanks,
Craig