On Wed, Jan 17, 2007 at 04:12:10AM -0500, Robin Humble wrote:
> so this isn't really an OpenMPI questions (I don't think), but you guys
> will have hit the problem if anyone has...
> basically I'm seeing wildly different bandwidths over InfiniBand 4x DDR
> when I use different kernels.
> I'm testing with netpipe-3.6.2's NPmpi, but a home-grown pingpong sees
> the same thing.
> the default 2.6.9-42.0.3.ELsmp (and also sles10's kernel) gives ok
> bandwidth (50% of peak I guess is good?) at ~10 Gbit/s, but a pile of
> newer kernels (18.104.22.168, 2.6.20-rc4, 2.6.18-1.2732.4.2.el5.OFED_1_1(*))
> all max out at ~5.3 Gbit/s.
> half the bandwidth! :-(
> latency is the same.
Try to load ib_mthca with tune_pci=1 option on those kernels that are
> the same OpenMPI (1.1.1 from OSCAR, rebuild for openib support) and
> NPmpi was used with all kernels.
> I see an intermediate bandwidth if one kernel is the 'fast' 2.6.9 and
> another is a 'slow', so they don't appear to be using completely
> different protocols.
> it doesn't make any difference if I try to make extra-sure it's using
> openib with:
> mpirun --mca btl openib --mca btl_tcp_if_exclude lo,eth0 ...
> OS is CentOS 4.4 x86_64 which AFAICT includes packages based on OFED 1.0.
> lspci says the PCIe card is:
> InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
> and dmesg says that all kernels are using
> ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
> but also winges that 'HCA FW version 1.0.700 is old'.
> any ideas?
> very odd that all new kernels (including for RHEL5) are slow.
> will OFED 1.1 make any difference? it didn't build cleanly when I
> tried, but I can try and try again...
> thanks for any hints.
> (*) rhel5 + OFED 1.1 test kernel, rebuilt for centos4.4 from src.rpm at
> users mailing list