Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Highly variable performance
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-07-15 09:36:18


Per my other disclaimer, I'm trolling through my disastrous inbox and finding some orphaned / never-answered emails. Sorry for the delay!

On Jun 2, 2010, at 4:36 PM, Jed Brown wrote:

> The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected
> with QDR InfiniBand. The benchmark loops over
>
> MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD);
>
> with nlocal=10000 (80 KiB messages) 10000 times, so it normally runs in
> a few seconds.

Just to be clear -- you're running 8 procs locally on an 8 core node, right? (Cisco is an Intel partner -- I don't follow the AMD line much) So this should all be local communication with no external network involved, right?

> # JOB TIME (s) HOST
>
> ompirun
> lsf.o240562 killed 8*a6200
> lsf.o240563 9.2110e+01 8*a6200
> lsf.o240564 1.5638e+01 8*a6237
> lsf.o240565 1.3873e+01 8*a6228

Am I reading that right that it's 92 seconds vs. 13 seconds? Woof!

> ompirun -mca btl self,sm
> lsf.o240574 1.6916e+01 8*a6237
> lsf.o240575 1.7456e+01 8*a6200
> lsf.o240576 1.4183e+01 8*a6161
> lsf.o240577 1.3254e+01 8*a6203
> lsf.o240578 1.8848e+01 8*a6274

13 vs. 18 seconds. Better, but still dodgy.

> prun (quadrics)
> lsf.o240602 1.6168e+01 4*a2108+4*a2109
> lsf.o240603 1.6746e+01 4*a2110+4*a2111
> lsf.o240604 1.6371e+01 4*a2108+4*a2109
> lsf.o240606 1.6867e+01 4*a2110+4*a2111

Nice and consistent, as you mentioned. And I assume your notation here means that it's across 2 nodes.

> ompirun -mca btl self,openib
> lsf.o240776 3.1463e+01 8*a6203
> lsf.o240777 3.0418e+01 8*a6264
> lsf.o240778 3.1394e+01 8*a6203
> lsf.o240779 3.5111e+01 8*a6274

Also much better. Probably because all messages are equally penalized by going out to the HCA and back.

> ompirun -mca self,sm,openib
> lsf.o240851 1.3848e+01 8*a6244
> lsf.o240852 1.7362e+01 8*a6237
> lsf.o240854 1.3266e+01 8*a6204
> lsf.o240855 1.3423e+01 8*a6276

This should be pretty much the same as sm,self, because openib shouldn't be used for any of the communication (i.e., Open MPI should determine that sm is the "best" transport between all the peers and silently discard openib).

> ompirun
> lsf.o240858 1.4415e+01 8*a6244
> lsf.o240859 1.5092e+01 8*a6237
> lsf.o240860 1.3940e+01 8*a6204
> lsf.o240861 1.5521e+01 8*a6276
> lsf.o240903 1.3273e+01 8*a6234
> lsf.o240904 1.6700e+01 8*a6206
> lsf.o240905 1.4636e+01 8*a6269
> lsf.o240906 1.5056e+01 8*a6234

Strange that this would be different than the first one. It should be functionally equivalent to --mca self,sm,openib.

> ompirun -mca self,tcp
> lsf.o240948 1.8504e+01 8*a6234
> lsf.o240949 1.9317e+01 8*a6207
> lsf.o240950 1.8964e+01 8*a6234
> lsf.o240951 2.0764e+01 8*a6207

Variation here isn't too bad. The slowdown here (compared to sm) is likely because it's going through the TCP loopback stack vs. "directly" going to the peer in shared memory.

...a quick look through the rest seems to indicate that they're more-or-less consistent with what you showed above.

Your later mail says:

> Following up on this, I have partial resolution. The primary culprit
> appears to be stale files in a ramdisk non-uniformly distributed across
> the sockets, thus interactingly poorly with NUMA. The slow runs
> invariably have high numa_miss and numa_foreign counts. I still have
> trouble making it explain up to a factor of 10 degredation, but it
> certainly explains a factor of 3.

Try playing with Open MPI's process affinity options, like --bind-to-core (see mpirun(1)). This may help prevent some OS jitter in moving processes around, and allow pinning memory locally to each NUMA node.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/