Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-04 17:18:34


Hi Ralph

Thank you very much.
The "-mca btl ^sm" workaround seems to have solved the problem,
at least for the little hello_c.c test.
I just ran it fine up to 128 processes.

I confess I am puzzled by this workaround.
* Why should we turn off "sm" in a standalone machine,
where everything is supposed to operate via shared memory?
* Do I incur in a performance penalty by not using "sm"?
* What other mechanism is actually used by OpenMPI for process
communication in this case?

It seems to be using tcp, because when I try -np 256 I get this error:

[spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
of network connections a process can open was reached in file
../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
--------------------------------------------------------------------------
Error: system limit exceeded on number of network connections that can
be open
This can be resolved by setting the mca parameter
opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------

Anyway, no big deal, because we don't intend to oversubrcribe the
processors on real jobs anyway (and the very error message suggests a
workaround to increase np, if needed).

Many thanks,
Gus Correa

Ralph Castain wrote:
> I would certainly try it -mca btl ^sm and see if that solves the problem.
>
> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>
>> Gus Correa wrote:
>>
>>> Dear Open MPI experts
>>>
>>> I need your help to get Open MPI right on a standalone
>>> machine with Nehalem processors.
>>>
>>> How to tweak the mca parameters to avoid problems
>>> with Nehalem (and perhaps AMD processors also),
>>> where MPI programs hang, was discussed here before.
>>>
>>> However, I lost track of the details, how to work around the problem,
>>> and if it was fully fixed already perhaps.
>> Yes, perhaps the problem you're seeing is not what you remember being discussed.
>>
>> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably fixed.
>>
>>> I am now facing the problem directly on a single Nehalem box.
>>>
>>> I installed OpenMPI 1.4.1 from source,
>>> and compiled the test hello_c.c with mpicc.
>>> Then I tried to run it with:
>>>
>>> 1) mpirun -np 4 a.out
>>> It ran OK (but seemed to be slow).
>>>
>>> 2) mpirun -np 16 a.out
>>> It hung, and brought the machine to a halt.
>>>
>>> Any words of wisdom are appreciated.
>>>
>>> More info:
>>>
>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>> * OS is Fedora Core 12.
>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>> processors on a two-way motherboard and 48GB of RAM.
>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>> (I can see 16 "processors".)
>>>
>>> **
>>>
>>> What should I do?
>>>
>>> Use -mca btl ^sm ?
>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>> Use Both?
>>> Do something else?
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users