Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-05-04 19:34:40


On May 4, 2010, at 4:51 PM, Gus Correa wrote:

> Hi Ralph
>
> Ralph Castain wrote:
>> One possibility is that the sm btl might not like that you have hyperthreading enabled.
>
> I remember that hyperthreading was discussed months ago,
> in the previous incarnation of this problem/thread/discussion on "Nehalem vs. Open MPI".
> (It sounds like one of those supreme court cases ... )
>
> I don't really administer that machine,
> or any machine with hyperthreading,
> so I am not much familiar to the HT nitty-gritty.
> How do I turn off hyperthreading?
> Is it a BIOS or a Linux thing?
> I may try that.

I believe it can be turned off via an admin-level cmd, but I'm not certain about it

>
>> Another thing to check: do you have any paffinity settings turned on
> (e.g., mpi_paffinity_alone)?
>
> I didn't turn on or off any paffinity setting explicitly,
> either in the command line or in the mca config file.
> All that I did on the tests was to turn off "sm",
> or just use the default settings.
> I wonder if paffinity is on by default, is it?
> Should I turn it off?

It is off by default - I mention it because sometimes people have it set in the default MCA param file and don't realize it is on. Sounds okay here, though.

>
>> Our paffinity system doesn't handle hyperthreading at this time.
>
> OK, so *if* paffinity is on by default (Is it?),
> and hyperthreading is also on, as it is now,
> I must turn off one of them, maybe both, right?
> I may go combinatorial about this tomorrow.
> Can't do it today.
> Darn locked office door!

I would say don't worry about the paffinity right now - sounds like it is off. You can always check, though, by running "ompi_info --param opal all" and checking for the setting of the opal_paffinity_alone variable

>
>> I'm just suspicious of the HT since you have a quad-core machine,
> and the limit where things work seems to be 4...
>
> It may be.
> If you tell me how to turn off HT (I'll google around for it meanwhile),
> I will do it tomorrow, if I get a chance to
> hard reboot that pesky machine now locked behind a door.

Yeah, I'm beginning to believe it is the HT that is causing the problem...

>
> Thanks again for your help.
>
> Gus
>
>> On May 4, 2010, at 3:44 PM, Gus Correa wrote:
>>> Hi Jeff
>>>
>>> Sure, I will certainly try v1.4.2.
>>> I am downloading it right now.
>>> As of this morning, when I first downloaded,
>>> the web site still had 1.4.1.
>>> Maybe I should have refreshed the web page on my browser.
>>>
>>> I will tell you how it goes.
>>>
>>> Gus
>>>
>>> Jeff Squyres wrote:
>>>> Gus -- Can you try v1.4.2 which was just released today?
>>>> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>>>>> Hi Ralph
>>>>>
>>>>> Thank you very much.
>>>>> The "-mca btl ^sm" workaround seems to have solved the problem,
>>>>> at least for the little hello_c.c test.
>>>>> I just ran it fine up to 128 processes.
>>>>>
>>>>> I confess I am puzzled by this workaround.
>>>>> * Why should we turn off "sm" in a standalone machine,
>>>>> where everything is supposed to operate via shared memory?
>>>>> * Do I incur in a performance penalty by not using "sm"?
>>>>> * What other mechanism is actually used by OpenMPI for process
>>>>> communication in this case?
>>>>>
>>>>> It seems to be using tcp, because when I try -np 256 I get this error:
>>>>>
>>>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
>>>>> of network connections a process can open was reached in file
>>>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>>>>> --------------------------------------------------------------------------
>>>>> Error: system limit exceeded on number of network connections that can
>>>>> be open
>>>>> This can be resolved by setting the mca parameter
>>>>> opal_set_max_sys_limits to 1,
>>>>> increasing your limit descriptor setting (using limit or ulimit commands),
>>>>> or asking the system administrator to increase the system limit.
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Anyway, no big deal, because we don't intend to oversubrcribe the
>>>>> processors on real jobs anyway (and the very error message suggests a
>>>>> workaround to increase np, if needed).
>>>>>
>>>>> Many thanks,
>>>>> Gus Correa
>>>>>
>>>>> Ralph Castain wrote:
>>>>>> I would certainly try it -mca btl ^sm and see if that solves the problem.
>>>>>>
>>>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>>>>
>>>>>>> Gus Correa wrote:
>>>>>>>
>>>>>>>> Dear Open MPI experts
>>>>>>>>
>>>>>>>> I need your help to get Open MPI right on a standalone
>>>>>>>> machine with Nehalem processors.
>>>>>>>>
>>>>>>>> How to tweak the mca parameters to avoid problems
>>>>>>>> with Nehalem (and perhaps AMD processors also),
>>>>>>>> where MPI programs hang, was discussed here before.
>>>>>>>>
>>>>>>>> However, I lost track of the details, how to work around the problem,
>>>>>>>> and if it was fully fixed already perhaps.
>>>>>>> Yes, perhaps the problem you're seeing is not what you remember being discussed.
>>>>>>>
>>>>>>> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably fixed.
>>>>>>>
>>>>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>>>>
>>>>>>>> I installed OpenMPI 1.4.1 from source,
>>>>>>>> and compiled the test hello_c.c with mpicc.
>>>>>>>> Then I tried to run it with:
>>>>>>>>
>>>>>>>> 1) mpirun -np 4 a.out
>>>>>>>> It ran OK (but seemed to be slow).
>>>>>>>>
>>>>>>>> 2) mpirun -np 16 a.out
>>>>>>>> It hung, and brought the machine to a halt.
>>>>>>>>
>>>>>>>> Any words of wisdom are appreciated.
>>>>>>>>
>>>>>>>> More info:
>>>>>>>>
>>>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>>>>> * OS is Fedora Core 12.
>>>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>>>>> (I can see 16 "processors".)
>>>>>>>>
>>>>>>>> **
>>>>>>>>
>>>>>>>> What should I do?
>>>>>>>>
>>>>>>>> Use -mca btl ^sm ?
>>>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>>>>> Use Both?
>>>>>>>> Do something else?
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users