Subject: [OMPI users] openmpi, 1.6.3, mlx4_core, log_num_mtt and Debian/vanilla kernel
From: Stefan Friedel (stefan.friedel_at_[hidden])
Date: 2013-02-21 05:53:45

Good morning,
I'm struggling with the setup of openmpi-1.6.3 on top of Debian
wheezy/testing and mellanox/ofed/mlx4 memory pinning- cluster equipped
with Mellanox HCAs MT26428, Debian 3.2.35-2 x86_64, 4x8core AMD Opteron
6212, 128G Memory.

I'm aware of the FAQ entries about mlx4_core module parameters
(log_num_mtt etc.) but the module in Debian kernels (resp. kernels from up to recent 3.8) does not know anything about log_num_mtt.
This parameter is only available in the OFED rpms for SLES/RHEL/OEL.

Jobs started with the the default environment do fail (log_mtts_per_seg
is a valid parameter in mxl4_core/Debian kernel and set to 3;
log_num_mtt is not a valid parameter of mxl4_core and set to 20 in
btl_openib.c, ...Your MPI job will continue, but may be behave poorly
and/or hang..., a simple benchmark will run for hours instead of
returning a result after a few minutes, on the same hardware -Debian
Squeeze and openmpi-1.4.5- this job runs flawlessly)

Is there a way to tell openmpi-1.6.3 to use the ofed-module from vanilla
kernel and not to rely on log_num_mtt for
"do-we-have-enough-registred-mem" computation for Mellanox HCAs? Any
other idea/hint?

Stefan Friedel

IWR * 523 * INF 368 * 69120 Heidelberg
T +49 6221 548240 * F +49 6221 545224