Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Highly variable performance
From: Jed Brown (jed_at_[hidden])
Date: 2010-07-15 15:14:37

On Thu, 15 Jul 2010 13:03:31 -0400, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> Given the oversubscription on the existing HT links, could contention
> account for the difference? (I have no idea how HT's contention
> management works) Meaning: if the stars line up in a given run, you
> could end up with very little/no contention and you get good
> bandwidth. But if there's a bit of jitter, you could end up with
> quite a bit of contention that ends up cascading into a bunch of
> additional delay.

What contention? Many sockets needing to access memory on another
socket via HT links? Then yes, perhaps that could be a lot. As show in
the diagram, it's pretty non-uniform, and if, say sockets 0, 1, and 3
all found memory on socket 0 (say socket 2 had local memory), then there
are two ways for messages to get from 3 to 0 (via 1 or via 2). I don't
know if there is hardware support to re-route to avoid contention, but
if not, then socket 3 could be sharing the 1->0 HT link (which has max
throughput of 8 GB/s, therefore 4 GB/s would be available per socket,
provided it was still operating at peak). Note that this 4 GB/s is
still less than splitting the 10.7 GB/s three ways.

> I fail to see how that could add up to 70-80 (or more) seconds of
> difference -- 13 secs vs. 90+ seconds (and more), though... 70-80
> seconds sounds like an IO delay -- perhaps paging due to the ramdisk
> or somesuch...? That's a SWAG.

This problem should have had significantly less resident than would
cause paging, but these were very short jobs so a relatively small
amount of paging would cause a big performance hit. We have also seen
up to a factor of 10 variability in longer jobs (e.g. 1 hour for a
"fast" run), with larger working sets, but once the pages are faulted,
this kernel (2.6.18 from RHEL5) won't migrate them around, so even if
you eventually swap out all the ramdisk, pages faulted before and after
will be mapped to all sorts of inconvenient places.

But, I don't have any systematic testing with a guaranteed clean
ramdisk, and I'm not going to overanalyze the extra factors when there's
an understood factor of 3 hanging in the way. I'll give an update if
there is any news.