Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-04 23:38:21

Hi Doug

Thank you for your input.
I fully agree with you.
I do not expect to get much from hyperthreading in terms of performance.
However, at this point I am just interested in having Open MPI working
right with *both* HT on and HT off.

Anyway, back to your comment about the usefulness of HT.
This is all hearsay, the hands waving argument I heard about the Intel
hyperthreading (HT), and its IBM cousin "symmetric multi-threading"
(SMT), and most likely some other equivalents out there.
You kind of suggested some of these points in your message.
In any case, please don't quote me on that,
although just posting this on the list already puts me on the spot.

An expert could jump in and correct me, please.

1) HT/SMT works well for those codes that have many
branch/decisions (like if/else), as the new instructions to
be fetched/executed are not predictable, and by having two threads
active on a single core can harness those idle core/CPU cycles
when the "other" thread is fetching a new non-predictable
instruction after the frequent branches/decisions.

2) Predictable instructions, on the other hand, can be piplelined
to be executed, and do not leave much of CPU idle cycles.

Most of our scientific codes (finite element, finite differences,
finite volume, spectral, linear algebra solvers) are NOT characterized
by branches, but by big repetitive inner loops that do not leave
much of idle CPU cycles.
(Well, at least when they are thoughtfully written.)

I.e., they are mostly made of predictable instructions that fit
nicely in the CPU pipeline.
Hence, the active thread becomes greedy,
and doesn't give much of a chance
to the "other" thread to get the hold of the CPU.
Hence, hyperthreading on these codes is not helpful.

That is the (common) wisdom about HT/SMT I was told.

3) However, I saw one person reporting modest gains in speedup
(10-20%) when running an ocean model (finite differences, domain
decomposition, OpenMPI, actually a Mac OS-X cluster).
It may have been here in this list, IIRR.

I myself experienced speedup numbers on this range, maybe up to 30%,
not on Linux, but on a IBM Power6 big machine
(32 CPUs/node, look like 64 CPUs/node with SMT turned on).
On these IBM machines SMT is turned on/off by the user,
via environment variables, which is very convenient.
This was when I ran a coupled
climate model (5 executables in MPMD mode using the IBM MPI).

4) I am not so surprised by the numbers you reported.
Based on the common wisdom above, the more optimized
the loops on your code are, the less useful HT becomes.
You may need to screw up the code a bit, say, by inserting
branch/decisions in your inner loops, for HT to be of help.
However, the net gain by doing that may be actually a loss
w.r.t. to just running the optimized code without HT, I would guess.
There is nothing like a clean and clever algorithm.

Gus Correa
(still struggling to get Open MPI to get along with HT,
but now self-promoted to parallel programming theoretician :) )
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA

Doug Reeder wrote:
> Hello,
> I have a mac with two quad core nehalem chips (8 cores). The sysctl
> command shows 16 cpus (apparently w/ hyperthreading). I have a finite
> element code that runs in parallel using openmpi. Running on the single
> machine using openmpi -np 8 runs in about 2/3 time that running with -np
> 16 does. The program is very well optimized for parallel processing so I
> strongly suspect that hyperthreading is not helping. The program fairly
> aggressively uses 100% of each cpu it is on so I don't think
> hyperthreading gets much of a chance to split the cpu activity. I would
> certainly welcome input/insight from an intel hardware engineer. I make
> sure that I don't ask for more processors than there are physical cores
> and that seems to work.
> Doug Reeder
> On May 4, 2010, at 7:06 PM, Gus Correa wrote:
>> Hi Ralph
>> Thank you so much for your help.
>> You are right, paffinity is turned off (default):
>> **************
>> /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/ompi_info --param opal all |
>> grep paffinity
>> MCA opal: parameter "opal_paffinity_alone" (current
>> value: "0", data source: default value, synonyms: mpi_paffinity_alone,
>> mpi_paffinity_alone)
>> **************
>> I will try your suggestion to turn off HT tomorrow,
>> and report back here.
>> Douglas Guptill kindly sent a recipe to turn HT off via BIOS settings.
>> Cheers,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>> Ralph Castain wrote:
>>> On May 4, 2010, at 4:51 PM, Gus Correa wrote:
>>>> Hi Ralph
>>>> Ralph Castain wrote:
>>>>> One possibility is that the sm btl might not like that you have
>>>>> hyperthreading enabled.
>>>> I remember that hyperthreading was discussed months ago,
>>>> in the previous incarnation of this problem/thread/discussion on
>>>> "Nehalem vs. Open MPI".
>>>> (It sounds like one of those supreme court cases ... )
>>>> I don't really administer that machine,
>>>> or any machine with hyperthreading,
>>>> so I am not much familiar to the HT nitty-gritty.
>>>> How do I turn off hyperthreading?
>>>> Is it a BIOS or a Linux thing?
>>>> I may try that.
>>> I believe it can be turned off via an admin-level cmd, but I'm not
>>> certain about it
>>>>> Another thing to check: do you have any paffinity settings turned on
>>>> (e.g., mpi_paffinity_alone)?
>>>> I didn't turn on or off any paffinity setting explicitly,
>>>> either in the command line or in the mca config file.
>>>> All that I did on the tests was to turn off "sm",
>>>> or just use the default settings.
>>>> I wonder if paffinity is on by default, is it?
>>>> Should I turn it off?
>>> It is off by default - I mention it because sometimes people have it
>>> set in the default MCA param file and don't realize it is on. Sounds
>>> okay here, though.
>>>>> Our paffinity system doesn't handle hyperthreading at this time.
>>>> OK, so *if* paffinity is on by default (Is it?),
>>>> and hyperthreading is also on, as it is now,
>>>> I must turn off one of them, maybe both, right?
>>>> I may go combinatorial about this tomorrow.
>>>> Can't do it today.
>>>> Darn locked office door!
>>> I would say don't worry about the paffinity right now - sounds like
>>> it is off. You can always check, though, by running "ompi_info
>>> --param opal all" and checking for the setting of the
>>> opal_paffinity_alone variable
>>>>> I'm just suspicious of the HT since you have a quad-core machine,
>>>> and the limit where things work seems to be 4...
>>>> It may be.
>>>> If you tell me how to turn off HT (I'll google around for it
>>>> meanwhile),
>>>> I will do it tomorrow, if I get a chance to
>>>> hard reboot that pesky machine now locked behind a door.
>>> Yeah, I'm beginning to believe it is the HT that is causing the
>>> problem...
>>>> Thanks again for your help.
>>>> Gus
>>>>> On May 4, 2010, at 3:44 PM, Gus Correa wrote:
>>>>>> Hi Jeff
>>>>>> Sure, I will certainly try v1.4.2.
>>>>>> I am downloading it right now.
>>>>>> As of this morning, when I first downloaded,
>>>>>> the web site still had 1.4.1.
>>>>>> Maybe I should have refreshed the web page on my browser.
>>>>>> I will tell you how it goes.
>>>>>> Gus
>>>>>> Jeff Squyres wrote:
>>>>>>> Gus -- Can you try v1.4.2 which was just released today?
>>>>>>> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>>>>>>>> Hi Ralph
>>>>>>>> Thank you very much.
>>>>>>>> The "-mca btl ^sm" workaround seems to have solved the problem,
>>>>>>>> at least for the little hello_c.c test.
>>>>>>>> I just ran it fine up to 128 processes.
>>>>>>>> I confess I am puzzled by this workaround.
>>>>>>>> * Why should we turn off "sm" in a standalone machine,
>>>>>>>> where everything is supposed to operate via shared memory?
>>>>>>>> * Do I incur in a performance penalty by not using "sm"?
>>>>>>>> * What other mechanism is actually used by OpenMPI for process
>>>>>>>> communication in this case?
>>>>>>>> It seems to be using tcp, because when I try -np 256 I get this
>>>>>>>> error:
>>>>>>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit
>>>>>>>> on number
>>>>>>>> of network connections a process can open was reached in file
>>>>>>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> Error: system limit exceeded on number of network connections
>>>>>>>> that can
>>>>>>>> be open
>>>>>>>> This can be resolved by setting the mca parameter
>>>>>>>> opal_set_max_sys_limits to 1,
>>>>>>>> increasing your limit descriptor setting (using limit or ulimit
>>>>>>>> commands),
>>>>>>>> or asking the system administrator to increase the system limit.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> Anyway, no big deal, because we don't intend to oversubrcribe the
>>>>>>>> processors on real jobs anyway (and the very error message
>>>>>>>> suggests a
>>>>>>>> workaround to increase np, if needed).
>>>>>>>> Many thanks,
>>>>>>>> Gus Correa
>>>>>>>> Ralph Castain wrote:
>>>>>>>>> I would certainly try it -mca btl ^sm and see if that solves
>>>>>>>>> the problem.
>>>>>>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>>>>>>>> Gus Correa wrote:
>>>>>>>>>>> Dear Open MPI experts
>>>>>>>>>>> I need your help to get Open MPI right on a standalone
>>>>>>>>>>> machine with Nehalem processors.
>>>>>>>>>>> How to tweak the mca parameters to avoid problems
>>>>>>>>>>> with Nehalem (and perhaps AMD processors also),
>>>>>>>>>>> where MPI programs hang, was discussed here before.
>>>>>>>>>>> However, I lost track of the details, how to work around the
>>>>>>>>>>> problem,
>>>>>>>>>>> and if it was fully fixed already perhaps.
>>>>>>>>>> Yes, perhaps the problem you're seeing is not what you
>>>>>>>>>> remember being discussed.
>>>>>>>>>> Perhaps you're thinking of
>>>>>>>>>> . It's
>>>>>>>>>> presumably fixed.
>>>>>>>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>>>>>>> I installed OpenMPI 1.4.1 from source,
>>>>>>>>>>> and compiled the test hello_c.c with mpicc.
>>>>>>>>>>> Then I tried to run it with:
>>>>>>>>>>> 1) mpirun -np 4 a.out
>>>>>>>>>>> It ran OK (but seemed to be slow).
>>>>>>>>>>> 2) mpirun -np 16 a.out
>>>>>>>>>>> It hung, and brought the machine to a halt.
>>>>>>>>>>> Any words of wisdom are appreciated.
>>>>>>>>>>> More info:
>>>>>>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>>>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>>>>>>>> * OS is Fedora Core 12.
>>>>>>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>>>>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>>>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>>>>>>>> (I can see 16 "processors".)
>>>>>>>>>>> **
>>>>>>>>>>> What should I do?
>>>>>>>>>>> Use -mca btl ^sm ?
>>>>>>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>>>>>>>> Use Both?
>>>>>>>>>>> Do something else?
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]