On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:
> How does the efficiency of loopback
> (let's say, over TCP and over IB) compare with "sm"?
Definitely not as good; that's why we have sm. :-) I don't have any quantification of that assertion, though (i.e., no numbers to back that up).
> FYI, I do NOT see the problem reported by Matthew et al.
> on our AMD Opteron Shanghai dual-socket quad-core.
> They run a quite outdated
> CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2.
> and OpenMPI 1.3.2.
> (I've been lazy to upgrade, it is a production machine.)
> I could run all three OpenMPI test programs (hello_c, ring_c, and
> connectivity_c) on all 8 cores on a single node WITH "sm" turned ON
> with no problem whatsoever.
> Moreover, all works fine if I oversuscribe up to 256 processes on
> one node.
> Beyond that I get segmentation fault (not hanging) sometimes,
> but not always.
> I understand that extreme oversubscription is a no-no.
It's quite possible that extreme oversubscription and/or that many procs in sm have not been well-tested.
> Moreover, on the screenshots that Matthew posted, the cores
> were at 100% CPU utilization on the simple connectivity_c
> (although this was when he had "sm" turned on on Nehalem).
> On my platform I don't get anything more than 3% or so.
100% CPU utilization usually means that some completion hasn't occurred that was expected and therefore everything is spinning waiting for that completion. The "hasn't occurred" bit is probably the bug here -- it's likely that there should have been a completion that somehow got missed. But this is speculative -- we're still investigating...