Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)
From: Gus Correa (gus_at_[hidden])
Date: 2009-12-10 17:53:03

Hi Jeff

Thanks for jumping in! :)
And for your clarifications too, of course.

How does the efficiency of loopback
(let's say, over TCP and over IB) compare with "sm"?

FYI, I do NOT see the problem reported by Matthew et al.
on our AMD Opteron Shanghai dual-socket quad-core.
They run a quite outdated
CentOS kernel 2.6.18-92.1.22.el5, with gcc 4.1.2.
and OpenMPI 1.3.2.
(I've been lazy to upgrade, it is a production machine.)

I could run all three OpenMPI test programs (hello_c, ring_c, and
connectivity_c) on all 8 cores on a single node WITH "sm" turned ON
with no problem whatsoever.
(I also had IB turned on, but I can run again
with sm only if you think this can make a difference.)

Moreover, all works fine if I oversuscribe up to 256 processes on
one node.
Beyond that I get segmentation fault (not hanging) sometimes,
but not always.
I understand that extreme oversubscription is a no-no.

Moreover, on the screenshots that Matthew posted, the cores
were at 100% CPU utilization on the simple connectivity_c
(although this was when he had "sm" turned on on Nehalem).
On my platform I don't get anything more than 3% or so.

Matthew: Which levels of CPU utilization do you see now?

My two speculative cents.
Gus Correa
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA

Jeff Squyres wrote:
> On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:
>> A couple of questions to the OpenMPI pros:
>> If shared memory ("sm") is turned off on a standalone computer,
>> which mechanism is used for MPI communication?
>> TCP via loopback port? Other?
> Whatever device supports node-local loopback. TCP is one; some OpenFabrics devices do, too.
>> Why wouldn't shared memory work right on Nehalem?
>> (That is probably distressing for Mark, Matthew, and other Nehalem owners.)
> To be clear, we don't know that this is a Nehalem-specific problem. We actually thought it was an AMD-specific problem, but these results are interesting. We've had a notoriously difficult time reproducing the problem reliably, which is why it hasn't been fixed yet. :-(
> The best luck so far in reproducing the problem has been with GCC 4.4.x (at Sun). I've been trying for a few days to install GCC 4.4 on my machines without much luck yet. Still working on it...