Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-04 18:29:01


Hi Jeff

Sorry, same problem with v1.4.2.
Without any mca parameters set (i.e. withOUT -mca btl ^sm),
hello_c.c runs OK for np = 4 and 8.
(However slower than with the "sm" turned off in 1.4.1,
as suggested by Ralph an hour ago.)

Nevertheless, when I try np=16 it segfaults,
with the syslog messages below.
After that the machine goes south,
I can ping it, but not ssh to it.
This was the same behavior I saw and reported when using 1.4.1.

I can't run anything again today, because the machine hung again,
needs a hard reboot, and is locked in an office that
I don't have the keys of. :)

Anyway, I can live with the -mca btl ^sm.
This way it seems to use tcp, right?
I hope the impact on performance is not too large.
I will try that on 1.4.2 tomorrow, if somebody opens that office.

Regards,
Gus
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpirun -np 4 a.out

Hello, world, I am 0 of 4
Hello, world, I am 1 of 4
Hello, world, I am 2 of 4
Hello, world, I am 3 of 4

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpirun -np 8 a.out

Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 5 of 8
Hello, world, I am 6 of 8
Hello, world, I am 7 of 8

/opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpirun -np 16 a.out

--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 14716 on node
spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Message from syslogd_at_spinoza at May 4 18:02:56 ...
  kernel:------------[ cut here ]------------

Message from syslogd_at_spinoza at May 4 18:02:56 ...
  kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd_at_spinoza at May 4 18:02:56 ...
  kernel:last sysfs file:
/sys/devices/system/cpu/cpu15/topology/physical_package_id

Message from syslogd_at_spinoza at May 4 18:02:56 ...
  kernel:Stack:

Message from syslogd_at_spinoza at May 4 18:02:56 ...
  kernel:Call Trace:

Message from syslogd_at_spinoza at May 4 18:02:56 ...
  kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00
4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04
<0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01

Gus Correa wrote:
> Hi Jeff
>
> Sure, I will certainly try v1.4.2.
> I am downloading it right now.
> As of this morning, when I first downloaded,
> the web site still had 1.4.1.
> Maybe I should have refreshed the web page on my browser.
>
> I will tell you how it goes.
>
> Gus
>
> Jeff Squyres wrote:
>> Gus -- Can you try v1.4.2 which was just released today?
>>
>> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>>
>>> Hi Ralph
>>>
>>> Thank you very much.
>>> The "-mca btl ^sm" workaround seems to have solved the problem,
>>> at least for the little hello_c.c test.
>>> I just ran it fine up to 128 processes.
>>>
>>> I confess I am puzzled by this workaround.
>>> * Why should we turn off "sm" in a standalone machine,
>>> where everything is supposed to operate via shared memory?
>>> * Do I incur in a performance penalty by not using "sm"?
>>> * What other mechanism is actually used by OpenMPI for process
>>> communication in this case?
>>>
>>> It seems to be using tcp, because when I try -np 256 I get this error:
>>>
>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
>>> of network connections a process can open was reached in file
>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>>> --------------------------------------------------------------------------
>>>
>>> Error: system limit exceeded on number of network connections that can
>>> be open
>>> This can be resolved by setting the mca parameter
>>> opal_set_max_sys_limits to 1,
>>> increasing your limit descriptor setting (using limit or ulimit
>>> commands),
>>> or asking the system administrator to increase the system limit.
>>> --------------------------------------------------------------------------
>>>
>>>
>>> Anyway, no big deal, because we don't intend to oversubrcribe the
>>> processors on real jobs anyway (and the very error message suggests a
>>> workaround to increase np, if needed).
>>>
>>> Many thanks,
>>> Gus Correa
>>>
>>> Ralph Castain wrote:
>>>> I would certainly try it -mca btl ^sm and see if that solves the
>>>> problem.
>>>>
>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>>
>>>>> Gus Correa wrote:
>>>>>
>>>>>> Dear Open MPI experts
>>>>>>
>>>>>> I need your help to get Open MPI right on a standalone
>>>>>> machine with Nehalem processors.
>>>>>>
>>>>>> How to tweak the mca parameters to avoid problems
>>>>>> with Nehalem (and perhaps AMD processors also),
>>>>>> where MPI programs hang, was discussed here before.
>>>>>>
>>>>>> However, I lost track of the details, how to work around the problem,
>>>>>> and if it was fully fixed already perhaps.
>>>>> Yes, perhaps the problem you're seeing is not what you remember
>>>>> being discussed.
>>>>>
>>>>> Perhaps you're thinking of
>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably
>>>>> fixed.
>>>>>
>>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>>
>>>>>> I installed OpenMPI 1.4.1 from source,
>>>>>> and compiled the test hello_c.c with mpicc.
>>>>>> Then I tried to run it with:
>>>>>>
>>>>>> 1) mpirun -np 4 a.out
>>>>>> It ran OK (but seemed to be slow).
>>>>>>
>>>>>> 2) mpirun -np 16 a.out
>>>>>> It hung, and brought the machine to a halt.
>>>>>>
>>>>>> Any words of wisdom are appreciated.
>>>>>>
>>>>>> More info:
>>>>>>
>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>>> * OS is Fedora Core 12.
>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>>> (I can see 16 "processors".)
>>>>>>
>>>>>> **
>>>>>>
>>>>>> What should I do?
>>>>>>
>>>>>> Use -mca btl ^sm ?
>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>>> Use Both?
>>>>>> Do something else?
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users