Sorry for crossposting, I already posted this report to the users list,
but the developers list is probably more relevant.
I have a cluster with two Intel Xeon Nehalem E5520 CPU per server
quad-core, 2.27GHz). The interconnect is 4xQDR Infiniband (Mellanox
I have compiled and installed OpenMPI 1.4.2. Openmpi was compiled with
"--with-libnuma --with-sge using gcc 4.4 and "-march=native -O3". The
kernel is 126.96.36.199 and I have compiled the kernel myself. The system is
Centos 5.4. I use gridengine 6.2u5. The OFED stack installed is 1.5.1.
The problem is that I get very bad performance unless I explicitly
exclude the "sm" btl and I can't figure out why. I have tried searching
the web and the OpenMPI mailing lists. I have seen reports about
non-optimal performance, but my results are far worse than any other
reports I have found.
I run the "mpi_stress" program with different packet lengths. I run on a
single server using 8 slots so that all eight cores on one server are
occupied, just to see the loopback/shared memory performance.
When I use "-mca btl self,openib" I get pretty good results, between
450MB/s and 700MB/s depending on the packet lengths. When I use "-mca
btl self,sm" or "-mca btl self,sm,openib" I just get 9MB/s for 1MB
packets and 1.5MB/s for 10kB packets. Following the FAQ I have tried
tweaking btl_sm_num_fifos=8 and btl_sm_eager_limit=65536 which improves
things to 30MB/s for 1MB packets and 5MB/s for 10kB packets. With
"-mca_paffinity_alone=1" I gain another 20% speedup.
But still this is pretty louse. I had expected several GB/s. What is
going on? Any hints? I thought these CPU's had excellent SM-bandwidth
Hyperthreading is enabled, if that is relevant. The locked-memory limit
is 500MB and the stack limit is 64MB.