Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Mellanox MLX4_EVENT_TYPE_SRQ_LIMIT kernel messages
From: Mark Dixon (m.c.dixon_at_[hidden])
Date: 2012-09-28 11:38:19


Hi,

We've been putting a new Mellanox QDR Intel Sandy Bridge cluster, based on
CentOS 6.3, through its paces and we're getting repeated kernel messages
we never used to get on CentOS 5. An example on one node:

Sep 28 09:58:20 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:27 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:27 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:29 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:29 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:31 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:31 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:32 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:45 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:45 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 10:08:23 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: MLX4_EVENT_TYPE_SRQ_LIMIT

These messages appeared when running IMB compiled with openmpi 1.6.1
across 256 cores (16 nodes, 16 cores per node). The job ran from
09:56:54 to 10:08:46 and failed with no obvious error messages.

Now, I'm used to IMB running into trouble at larger core counts, but I'm
wondering if anyone has seen these messages before and know if they
indicate a problem?

We're running with an increased log_num_mtt mlx4_core option as
recommended by the openmpi FAQ and increased log_num_srq to its maximum
value in a failed attempt to get rid of the messages:

$ cat /etc/modprobe.d/libmlx4_local.conf
options mlx4_core log_num_mtt=24 log_mtts_per_seg=3 log_num_srq=20

Any thoughts?

Thanks,

Mark

-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon_at_[hidden]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------