I am trying to run WRF on 1024 cores with OpenMPI 1.3.3 and
1.4. I can get the code to run with 512 cores, but it crashes
at startup on 1024 cores. I am getting the following error message:
[n172][[43536,1],0][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory
[n172][[43536,1],0][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect
From google, I have tried to change the settings for btl_openib_receive_queues,
but my tries have not worked. Here was my latest try to reduce the
total queue pairs.
mpirun -np 1024 \
-mca btl_openib_receive_queues P,128,2048,128,128:S,65536,256,192,128 \
`wrf.exe
These settings did not help.
Am I looking in the right place?
System setup:
Centos-5.3
Ofed-1.4.1
Intel Compiler 11.1.038
Openmpi-1.3.3 and 1.4
Build options:
./configure CC=icc CXX=icpc F77=ifort F90=ifort FC=ifort --prefix=/opt/openmpi/1.3.3-intel --without-sge --with-openib --enable-io-romio
--with-io-romio-flags=--with-file-system=lustre --with-pic
Thanks,
Craig
|