On our quest for better shared memory collective, we did some runs
with 16 cores Intel machines. The SM worked as expected, as far as I
can tell. Unfortunately we only have one such node, so we never tried
more than 16 processes.
On Jul 24, 2008, at 11:13 PM, Ralph Castain wrote:
> Yo folks
> We are trying to run some tests on a new cluster and are having a
> problem telling hardware, system software, and OMPI failures apart.
> This is a 16-ppn Opteron system running SLURM under RHEL (forget the
> precise version), with IB and OMPI 1.2.6.
> Everything launches just fine and seems to work okay. However, on
> large jobs (e.g., >450 procs), the IMB tests fail and crash a bunch
> of the nodes on which they are running.
> Has anyone else been able to test in 16+ ppn configurations? I'm
> wondering if we have an SM problem - perhaps inadequate backing file
> space or something?
> Any suggestions on how to debug this or config options for higher
> ppn systems would be appreciated. We don't see this problem on
> anything with lesser ppn. I'm going to give it a try with 1.3 and
> see what happens there.
> devel mailing list
- application/pkcs7-signature attachment: smime.p7s