Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-04 18:51:49


Hi Ralph

Ralph Castain wrote:
> One possibility is that the sm btl might not like that
> you have hyperthreading enabled.

I remember that hyperthreading was discussed months ago,
in the previous incarnation of this problem/thread/discussion on
"Nehalem vs. Open MPI".
(It sounds like one of those supreme court cases ... )

I don't really administer that machine,
or any machine with hyperthreading,
so I am not much familiar to the HT nitty-gritty.
How do I turn off hyperthreading?
Is it a BIOS or a Linux thing?
I may try that.

>
> Another thing to check: do you have any paffinity settings turned on
(e.g., mpi_paffinity_alone)?

I didn't turn on or off any paffinity setting explicitly,
either in the command line or in the mca config file.
All that I did on the tests was to turn off "sm",
or just use the default settings.
I wonder if paffinity is on by default, is it?
Should I turn it off?

> Our paffinity system doesn't handle hyperthreading at this time.
>

OK, so *if* paffinity is on by default (Is it?),
and hyperthreading is also on, as it is now,
I must turn off one of them, maybe both, right?
I may go combinatorial about this tomorrow.
Can't do it today.
Darn locked office door!

> I'm just suspicious of the HT since you have a quad-core machine,
and the limit where things work seems to be 4...
>

It may be.
If you tell me how to turn off HT (I'll google around for it meanwhile),
I will do it tomorrow, if I get a chance to
hard reboot that pesky machine now locked behind a door.

Thanks again for your help.

Gus

> On May 4, 2010, at 3:44 PM, Gus Correa wrote:
>
>> Hi Jeff
>>
>> Sure, I will certainly try v1.4.2.
>> I am downloading it right now.
>> As of this morning, when I first downloaded,
>> the web site still had 1.4.1.
>> Maybe I should have refreshed the web page on my browser.
>>
>> I will tell you how it goes.
>>
>> Gus
>>
>> Jeff Squyres wrote:
>>> Gus -- Can you try v1.4.2 which was just released today?
>>> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>>>> Hi Ralph
>>>>
>>>> Thank you very much.
>>>> The "-mca btl ^sm" workaround seems to have solved the problem,
>>>> at least for the little hello_c.c test.
>>>> I just ran it fine up to 128 processes.
>>>>
>>>> I confess I am puzzled by this workaround.
>>>> * Why should we turn off "sm" in a standalone machine,
>>>> where everything is supposed to operate via shared memory?
>>>> * Do I incur in a performance penalty by not using "sm"?
>>>> * What other mechanism is actually used by OpenMPI for process
>>>> communication in this case?
>>>>
>>>> It seems to be using tcp, because when I try -np 256 I get this error:
>>>>
>>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
>>>> of network connections a process can open was reached in file
>>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>>>> --------------------------------------------------------------------------
>>>> Error: system limit exceeded on number of network connections that can
>>>> be open
>>>> This can be resolved by setting the mca parameter
>>>> opal_set_max_sys_limits to 1,
>>>> increasing your limit descriptor setting (using limit or ulimit commands),
>>>> or asking the system administrator to increase the system limit.
>>>> --------------------------------------------------------------------------
>>>>
>>>> Anyway, no big deal, because we don't intend to oversubrcribe the
>>>> processors on real jobs anyway (and the very error message suggests a
>>>> workaround to increase np, if needed).
>>>>
>>>> Many thanks,
>>>> Gus Correa
>>>>
>>>> Ralph Castain wrote:
>>>>> I would certainly try it -mca btl ^sm and see if that solves the problem.
>>>>>
>>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>>>
>>>>>> Gus Correa wrote:
>>>>>>
>>>>>>> Dear Open MPI experts
>>>>>>>
>>>>>>> I need your help to get Open MPI right on a standalone
>>>>>>> machine with Nehalem processors.
>>>>>>>
>>>>>>> How to tweak the mca parameters to avoid problems
>>>>>>> with Nehalem (and perhaps AMD processors also),
>>>>>>> where MPI programs hang, was discussed here before.
>>>>>>>
>>>>>>> However, I lost track of the details, how to work around the problem,
>>>>>>> and if it was fully fixed already perhaps.
>>>>>> Yes, perhaps the problem you're seeing is not what you remember being discussed.
>>>>>>
>>>>>> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably fixed.
>>>>>>
>>>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>>>
>>>>>>> I installed OpenMPI 1.4.1 from source,
>>>>>>> and compiled the test hello_c.c with mpicc.
>>>>>>> Then I tried to run it with:
>>>>>>>
>>>>>>> 1) mpirun -np 4 a.out
>>>>>>> It ran OK (but seemed to be slow).
>>>>>>>
>>>>>>> 2) mpirun -np 16 a.out
>>>>>>> It hung, and brought the machine to a halt.
>>>>>>>
>>>>>>> Any words of wisdom are appreciated.
>>>>>>>
>>>>>>> More info:
>>>>>>>
>>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>>>> * OS is Fedora Core 12.
>>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>>>> (I can see 16 "processors".)
>>>>>>>
>>>>>>> **
>>>>>>>
>>>>>>> What should I do?
>>>>>>>
>>>>>>> Use -mca btl ^sm ?
>>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>>>> Use Both?
>>>>>>> Do something else?
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users