Per my other disclaimer, I'm trolling through my disastrous inbox and finding some orphaned / never-answered emails. Sorry for the delay!
On Jun 2, 2010, at 4:36 PM, Jed Brown wrote:
> The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected
> with QDR InfiniBand. The benchmark loops over
> with nlocal=10000 (80 KiB messages) 10000 times, so it normally runs in
> a few seconds.
Just to be clear -- you're running 8 procs locally on an 8 core node, right? (Cisco is an Intel partner -- I don't follow the AMD line much) So this should all be local communication with no external network involved, right?
> # JOB TIME (s) HOST
> lsf.o240562 killed 8*a6200
> lsf.o240563 9.2110e+01 8*a6200
> lsf.o240564 1.5638e+01 8*a6237
> lsf.o240565 1.3873e+01 8*a6228
Am I reading that right that it's 92 seconds vs. 13 seconds? Woof!
> ompirun -mca btl self,sm
> lsf.o240574 1.6916e+01 8*a6237
> lsf.o240575 1.7456e+01 8*a6200
> lsf.o240576 1.4183e+01 8*a6161
> lsf.o240577 1.3254e+01 8*a6203
> lsf.o240578 1.8848e+01 8*a6274
13 vs. 18 seconds. Better, but still dodgy.
> prun (quadrics)
> lsf.o240602 1.6168e+01 4*a2108+4*a2109
> lsf.o240603 1.6746e+01 4*a2110+4*a2111
> lsf.o240604 1.6371e+01 4*a2108+4*a2109
> lsf.o240606 1.6867e+01 4*a2110+4*a2111
Nice and consistent, as you mentioned. And I assume your notation here means that it's across 2 nodes.
> ompirun -mca btl self,openib
> lsf.o240776 3.1463e+01 8*a6203
> lsf.o240777 3.0418e+01 8*a6264
> lsf.o240778 3.1394e+01 8*a6203
> lsf.o240779 3.5111e+01 8*a6274
Also much better. Probably because all messages are equally penalized by going out to the HCA and back.
> ompirun -mca self,sm,openib
> lsf.o240851 1.3848e+01 8*a6244
> lsf.o240852 1.7362e+01 8*a6237
> lsf.o240854 1.3266e+01 8*a6204
> lsf.o240855 1.3423e+01 8*a6276
This should be pretty much the same as sm,self, because openib shouldn't be used for any of the communication (i.e., Open MPI should determine that sm is the "best" transport between all the peers and silently discard openib).
> lsf.o240858 1.4415e+01 8*a6244
> lsf.o240859 1.5092e+01 8*a6237
> lsf.o240860 1.3940e+01 8*a6204
> lsf.o240861 1.5521e+01 8*a6276
> lsf.o240903 1.3273e+01 8*a6234
> lsf.o240904 1.6700e+01 8*a6206
> lsf.o240905 1.4636e+01 8*a6269
> lsf.o240906 1.5056e+01 8*a6234
Strange that this would be different than the first one. It should be functionally equivalent to --mca self,sm,openib.
> ompirun -mca self,tcp
> lsf.o240948 1.8504e+01 8*a6234
> lsf.o240949 1.9317e+01 8*a6207
> lsf.o240950 1.8964e+01 8*a6234
> lsf.o240951 2.0764e+01 8*a6207
Variation here isn't too bad. The slowdown here (compared to sm) is likely because it's going through the TCP loopback stack vs. "directly" going to the peer in shared memory.
...a quick look through the rest seems to indicate that they're more-or-less consistent with what you showed above.
Your later mail says:
> Following up on this, I have partial resolution. The primary culprit
> appears to be stale files in a ramdisk non-uniformly distributed across
> the sockets, thus interactingly poorly with NUMA. The slow runs
> invariably have high numa_miss and numa_foreign counts. I still have
> trouble making it explain up to a factor of 10 degredation, but it
> certainly explains a factor of 3.
Try playing with Open MPI's process affinity options, like --bind-to-core (see mpirun(1)). This may help prevent some OS jitter in moving processes around, and allow pinning memory locally to each NUMA node.
For corporate legal information go to: