Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-04 17:44:17


Hi Jeff

Sure, I will certainly try v1.4.2.
I am downloading it right now.
As of this morning, when I first downloaded,
the web site still had 1.4.1.
Maybe I should have refreshed the web page on my browser.

I will tell you how it goes.

Gus

Jeff Squyres wrote:
> Gus -- Can you try v1.4.2 which was just released today?
>
> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>
>> Hi Ralph
>>
>> Thank you very much.
>> The "-mca btl ^sm" workaround seems to have solved the problem,
>> at least for the little hello_c.c test.
>> I just ran it fine up to 128 processes.
>>
>> I confess I am puzzled by this workaround.
>> * Why should we turn off "sm" in a standalone machine,
>> where everything is supposed to operate via shared memory?
>> * Do I incur in a performance penalty by not using "sm"?
>> * What other mechanism is actually used by OpenMPI for process
>> communication in this case?
>>
>> It seems to be using tcp, because when I try -np 256 I get this error:
>>
>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
>> of network connections a process can open was reached in file
>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>> --------------------------------------------------------------------------
>> Error: system limit exceeded on number of network connections that can
>> be open
>> This can be resolved by setting the mca parameter
>> opal_set_max_sys_limits to 1,
>> increasing your limit descriptor setting (using limit or ulimit commands),
>> or asking the system administrator to increase the system limit.
>> --------------------------------------------------------------------------
>>
>> Anyway, no big deal, because we don't intend to oversubrcribe the
>> processors on real jobs anyway (and the very error message suggests a
>> workaround to increase np, if needed).
>>
>> Many thanks,
>> Gus Correa
>>
>> Ralph Castain wrote:
>>> I would certainly try it -mca btl ^sm and see if that solves the problem.
>>>
>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>
>>>> Gus Correa wrote:
>>>>
>>>>> Dear Open MPI experts
>>>>>
>>>>> I need your help to get Open MPI right on a standalone
>>>>> machine with Nehalem processors.
>>>>>
>>>>> How to tweak the mca parameters to avoid problems
>>>>> with Nehalem (and perhaps AMD processors also),
>>>>> where MPI programs hang, was discussed here before.
>>>>>
>>>>> However, I lost track of the details, how to work around the problem,
>>>>> and if it was fully fixed already perhaps.
>>>> Yes, perhaps the problem you're seeing is not what you remember being discussed.
>>>>
>>>> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably fixed.
>>>>
>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>
>>>>> I installed OpenMPI 1.4.1 from source,
>>>>> and compiled the test hello_c.c with mpicc.
>>>>> Then I tried to run it with:
>>>>>
>>>>> 1) mpirun -np 4 a.out
>>>>> It ran OK (but seemed to be slow).
>>>>>
>>>>> 2) mpirun -np 16 a.out
>>>>> It hung, and brought the machine to a halt.
>>>>>
>>>>> Any words of wisdom are appreciated.
>>>>>
>>>>> More info:
>>>>>
>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>> * OS is Fedora Core 12.
>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>> (I can see 16 "processors".)
>>>>>
>>>>> **
>>>>>
>>>>> What should I do?
>>>>>
>>>>> Use -mca btl ^sm ?
>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>> Use Both?
>>>>> Do something else?
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>