Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-05-04 18:17:48


One possibility is that the sm btl might not like that you have hyperthreading enabled.

Another thing to check: do you have any paffinity settings turned on (e.g., mpi_paffinity_alone)? Our paffinity system doesn't handle hyperthreading at this time.

I'm just suspicious of the HT since you have a quad-core machine, and the limit where things work seems to be 4...

On May 4, 2010, at 3:44 PM, Gus Correa wrote:

> Hi Jeff
>
> Sure, I will certainly try v1.4.2.
> I am downloading it right now.
> As of this morning, when I first downloaded,
> the web site still had 1.4.1.
> Maybe I should have refreshed the web page on my browser.
>
> I will tell you how it goes.
>
> Gus
>
> Jeff Squyres wrote:
>> Gus -- Can you try v1.4.2 which was just released today?
>> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>>> Hi Ralph
>>>
>>> Thank you very much.
>>> The "-mca btl ^sm" workaround seems to have solved the problem,
>>> at least for the little hello_c.c test.
>>> I just ran it fine up to 128 processes.
>>>
>>> I confess I am puzzled by this workaround.
>>> * Why should we turn off "sm" in a standalone machine,
>>> where everything is supposed to operate via shared memory?
>>> * Do I incur in a performance penalty by not using "sm"?
>>> * What other mechanism is actually used by OpenMPI for process
>>> communication in this case?
>>>
>>> It seems to be using tcp, because when I try -np 256 I get this error:
>>>
>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
>>> of network connections a process can open was reached in file
>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>>> --------------------------------------------------------------------------
>>> Error: system limit exceeded on number of network connections that can
>>> be open
>>> This can be resolved by setting the mca parameter
>>> opal_set_max_sys_limits to 1,
>>> increasing your limit descriptor setting (using limit or ulimit commands),
>>> or asking the system administrator to increase the system limit.
>>> --------------------------------------------------------------------------
>>>
>>> Anyway, no big deal, because we don't intend to oversubrcribe the
>>> processors on real jobs anyway (and the very error message suggests a
>>> workaround to increase np, if needed).
>>>
>>> Many thanks,
>>> Gus Correa
>>>
>>> Ralph Castain wrote:
>>>> I would certainly try it -mca btl ^sm and see if that solves the problem.
>>>>
>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>>
>>>>> Gus Correa wrote:
>>>>>
>>>>>> Dear Open MPI experts
>>>>>>
>>>>>> I need your help to get Open MPI right on a standalone
>>>>>> machine with Nehalem processors.
>>>>>>
>>>>>> How to tweak the mca parameters to avoid problems
>>>>>> with Nehalem (and perhaps AMD processors also),
>>>>>> where MPI programs hang, was discussed here before.
>>>>>>
>>>>>> However, I lost track of the details, how to work around the problem,
>>>>>> and if it was fully fixed already perhaps.
>>>>> Yes, perhaps the problem you're seeing is not what you remember being discussed.
>>>>>
>>>>> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably fixed.
>>>>>
>>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>>
>>>>>> I installed OpenMPI 1.4.1 from source,
>>>>>> and compiled the test hello_c.c with mpicc.
>>>>>> Then I tried to run it with:
>>>>>>
>>>>>> 1) mpirun -np 4 a.out
>>>>>> It ran OK (but seemed to be slow).
>>>>>>
>>>>>> 2) mpirun -np 16 a.out
>>>>>> It hung, and brought the machine to a halt.
>>>>>>
>>>>>> Any words of wisdom are appreciated.
>>>>>>
>>>>>> More info:
>>>>>>
>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>>> * OS is Fedora Core 12.
>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>>> (I can see 16 "processors".)
>>>>>>
>>>>>> **
>>>>>>
>>>>>> What should I do?
>>>>>>
>>>>>> Use -mca btl ^sm ?
>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>>> Use Both?
>>>>>> Do something else?
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users