We have one user code that is having lots of problems with RNRs or
sometimes hangs. (The same code runs ok on another IB based system which
has full connectivity and on our Myrinet system)
The IB network has a 7:3 overload, i.e. 7 nodes per 3 IB links up to the
main Cisco switch. In other words, we have 48 bladecenters with 14
blades (8 cores) in each with a IB switch per bladecenter and 2x3 IB
lines per bladecenter to the main Cisco switch.
Now to the question, do you have any good suggestions on parameters that
will help us get around this problem.
I tried changing the queue-pair settings and it does affect the problem
but so far i haven't been able to fix it completely.
The code usually works when running with nodes=8:ppn=8, but always fails
sooner or later with nodes=16:ppn=8.
Also turning off leave_pinned helps a bit.
The best settings i have so far are:
-mca mpi_leave_pinned 0 -mca btl_openib_receive_queues
I have tried almost anything i can think of and desperately need help
here. Building everything in debug mode helps somewhat due to the code
getting so slow that the network can keep up a lot better but not
OS: CentOS5.3 (OFED 1.3.2 and 1.4.2 tested)
HW: Mellanox MT25208 InfiniHost III Ex (128MB)
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake_at_[hidden] Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se