Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-04 17:18:34

Hi Ralph

Thank you very much.
The "-mca btl ^sm" workaround seems to have solved the problem,
at least for the little hello_c.c test.
I just ran it fine up to 128 processes.

I confess I am puzzled by this workaround.
* Why should we turn off "sm" in a standalone machine,
where everything is supposed to operate via shared memory?
* Do I incur in a performance penalty by not using "sm"?
* What other mechanism is actually used by OpenMPI for process
communication in this case?

It seems to be using tcp, because when I try -np 256 I get this error:

[spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
of network connections a process can open was reached in file
../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
Error: system limit exceeded on number of network connections that can
be open
This can be resolved by setting the mca parameter
opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.

Anyway, no big deal, because we don't intend to oversubrcribe the
processors on real jobs anyway (and the very error message suggests a
workaround to increase np, if needed).

Many thanks,
Gus Correa

Ralph Castain wrote:
> I would certainly try it -mca btl ^sm and see if that solves the problem.
> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>> Gus Correa wrote:
>>> Dear Open MPI experts
>>> I need your help to get Open MPI right on a standalone
>>> machine with Nehalem processors.
>>> How to tweak the mca parameters to avoid problems
>>> with Nehalem (and perhaps AMD processors also),
>>> where MPI programs hang, was discussed here before.
>>> However, I lost track of the details, how to work around the problem,
>>> and if it was fully fixed already perhaps.
>> Yes, perhaps the problem you're seeing is not what you remember being discussed.
>> Perhaps you're thinking of . It's presumably fixed.
>>> I am now facing the problem directly on a single Nehalem box.
>>> I installed OpenMPI 1.4.1 from source,
>>> and compiled the test hello_c.c with mpicc.
>>> Then I tried to run it with:
>>> 1) mpirun -np 4 a.out
>>> It ran OK (but seemed to be slow).
>>> 2) mpirun -np 16 a.out
>>> It hung, and brought the machine to a halt.
>>> Any words of wisdom are appreciated.
>>> More info:
>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>> * OS is Fedora Core 12.
>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>> processors on a two-way motherboard and 48GB of RAM.
>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>> (I can see 16 "processors".)
>>> **
>>> What should I do?
>>> Use -mca btl ^sm ?
>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>> Use Both?
>>> Do something else?
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]