We are trying to run some tests on a new cluster and are having a
problem telling hardware, system software, and OMPI failures apart.
This is a 16-ppn Opteron system running SLURM under RHEL (forget the
precise version), with IB and OMPI 1.2.6.
Everything launches just fine and seems to work okay. However, on
large jobs (e.g., >450 procs), the IMB tests fail and crash a bunch of
the nodes on which they are running.
Has anyone else been able to test in 16+ ppn configurations? I'm
wondering if we have an SM problem - perhaps inadequate backing file
space or something?
Any suggestions on how to debug this or config options for higher ppn
systems would be appreciated. We don't see this problem on anything
with lesser ppn. I'm going to give it a try with 1.3 and see what