Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] Very poor performance with btl sm on twin nehalem servers with Mellanox ConnectX installed
From: Oskar Enoksson (enok_at_[hidden])
Date: 2010-05-13 06:56:33


Sorry for crossposting, I already posted this report to the users list,
but the developers list is probably more relevant.

I have a cluster with two Intel Xeon Nehalem E5520 CPU per server
quad-core, 2.27GHz). The interconnect is 4xQDR Infiniband (Mellanox
ConnectX).

I have compiled and installed OpenMPI 1.4.2. Openmpi was compiled with
"--with-libnuma --with-sge using gcc 4.4 and "-march=native -O3". The
kernel is 2.6.32.12 and I have compiled the kernel myself. The system is
Centos 5.4. I use gridengine 6.2u5. The OFED stack installed is 1.5.1.

The problem is that I get very bad performance unless I explicitly
exclude the "sm" btl and I can't figure out why. I have tried searching
the web and the OpenMPI mailing lists. I have seen reports about
non-optimal performance, but my results are far worse than any other
reports I have found.

I run the "mpi_stress" program with different packet lengths. I run on a
single server using 8 slots so that all eight cores on one server are
occupied, just to see the loopback/shared memory performance.

When I use "-mca btl self,openib" I get pretty good results, between
450MB/s and 700MB/s depending on the packet lengths. When I use "-mca
btl self,sm" or "-mca btl self,sm,openib" I just get 9MB/s for 1MB
packets and 1.5MB/s for 10kB packets. Following the FAQ I have tried
tweaking btl_sm_num_fifos=8 and btl_sm_eager_limit=65536 which improves
things to 30MB/s for 1MB packets and 5MB/s for 10kB packets. With
"-mca_paffinity_alone=1" I gain another 20% speedup.

But still this is pretty louse. I had expected several GB/s. What is
going on? Any hints? I thought these CPU's had excellent SM-bandwidth
over quickpath.

Hyperthreading is enabled, if that is relevant. The locked-memory limit
is 500MB and the stack limit is 64MB.

Please help!
Thanks
/Oskar