Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-05 13:30:34


Hi Jeff, Ralph, list.

Sorry for the long email, and the delay to answer.
I had to test MPI/reboot the machine several times
to address the questions.
Hopefully with answers to all your questions inline below.

Jeff Squyres wrote:
> I'd actually be a little surprised if HT was the problem.
> I run with HT enabled on my nehalem boxen all the time.
> It's pretty surprising that Open MPI is causing a hard lockup
> of your system; user-level processes shouldn't be able to do that.
>

I hope I can do the same here! :)

> Notes:
>
> 1. With HT enabled, as you noted, Linux will just see
> 2x as many cores as you actually have.
> Depending on your desired workload,
> this may or may not help you.
> But that shouldn't affect the correctness of running your
> MPI application.
>

I agree and that is what I seek.
Correctness first, performance later.
I want OpenMPI to work correctly, with or without hyperthreading,
and preferably using the "sm" BTL.
In order, let's see what is possible, what works, what performs better.

***

Reporting the most recent experiments with v1.4.2,
1) hyperthreading turned ON,
2) then HT turned OFF, on the BIOS.

In both cases I tried
A) "-mca btl ^sm" and
B) without it.

(Just in case, I checked and /proc/cpuinfo reports a number of cores
consistent with the BIOS setting for HT.)

Details below, but first off,
my conclusion is that HT OFF or ON makes *NO difference*.
The problem seems to be with the "sm" btl.
When "sm" is on (default) OpenMPI breaks (at least on this computer).

################################
1) With hyperthreading turned ON:
################################

A) with -mca btl ^sm (i.e. "sm" OFF):
Ran fine with 4,8,...,128 processes and fails with 256,
due to system limit on the number of open TCP connections,
as reported before with 1.4.1.

B) withOUT any -mca parameters (i.e. "sm" ON)"
Ran fine with 4,...,32, but failed with 64 processes,
with the same segfault and syslog error messages I reported
before for both 1.4.1 and 1.4.2.
(see below)

Of course np=64 is oversubscribing, but this is just a "hello world"
lightweight test.
Moreover, in the previous experiments with both 1.4.1 and 1.4.2
the failures happened even earlier, with np = 16, which is the
exactly number of (virtual) processors with hyperthreading turned on,
i.e., with no oversubscription.

The machine returns the prompt, but hangs right after.

Could the failures be traced to some funny glitch in the
Fedora Core 12 (2.6.32.11-99.fc12.x86_6) SMP kernel?

[gus_at_spinoza ~]$ uname -a
Linux spinoza.ldeo.columbia.edu 2.6.32.11-99.fc12.x86_64 #1 SMP Mon Apr
5 19:59:38 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

********
ERROR messages:

  /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec -np 64 a.out

Message from syslogd_at_spinoza at May 4 22:28:15 ...
  kernel:------------[ cut here ]------------

Message from syslogd_at_spinoza at May 4 22:28:15 ...
  kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd_at_spinoza at May 4 22:28:15 ...
  kernel:last sysfs file:
/sys/devices/system/cpu/cpu15/topology/physical_package_id

Message from syslogd_at_spinoza at May 4 22:28:15 ...
  kernel:Stack:
--------------------------------------------------------------------------
mpiexec noticed that process rank 63 with PID 6587 on node
spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Message from syslogd_at_spinoza at May 4 22:28:15 ...
  kernel:Call Trace:

Message from syslogd_at_spinoza at May 4 22:28:15 ...
  kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00
4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04
<0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01

************
################################
2) Now with hyperthreading OFF:
################################

A) with -mca btl ^sm (i.e. "sm" OFF):
Ran fine with 4,8,...,128 processes and fails with 256,
due to system limit on the number of open TCP connections,
as reported before with 1.4.1.
This is exactly the same result as with HT ON.

B) withOUT any -mca parameters (i.e. "sm" ON)"
Ran fine with 4,...,32, but failed with 64 processes,
with the same syslog messages, but hung before showing
the Open MPI segfault message (see below).
So, again, very similar behavior as with HT ON

-------------------------------------------------------
My conclusion is that HT OFF or ON makes NO difference.
The problem seems to be with the "sm" btl.
-------------------------------------------------------

***********
ERROR MESSAGES

[root_at_spinoza examples]# /opt/sw/openmpi/1.4.2/gnu-4.4.3-4/bin/mpiexec
-np 64 a.out

Message from syslogd_at_spinoza at May 5 12:04:05 ...
  kernel:------------[ cut here ]------------

Message from syslogd_at_spinoza at May 5 12:04:05 ...
  kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd_at_spinoza at May 5 12:04:05 ...
  kernel:last sysfs file:
/sys/devices/system/cpu/cpu7/topology/physical_package_id

Message from syslogd_at_spinoza at May 5 12:04:05 ...
  kernel:Stack:

Message from syslogd_at_spinoza at May 5 12:04:05 ...
  kernel:Call Trace:

***********
> 2. To confirm: yes, TCP will be quite a bit slower than sm
> (but again, that depends on how much MPI traffic you're sending).
>

Thank you, the clarification is really important.
I suppose then that "sm" is preferred, if I can get it to work right.

The main goal is to run yet another atmospheric model on this machine.
It is a typical domain decomposition problem,
with a bunch of 2D arrays being exchanged
across domain boundaries at each time step.
This is the MPI traffic.
There are probably some collectives too,
but I haven't checked out the code.

> 3. Yes, you can disable the 2nd thread on each core via Linux,
> but you need root-level access to do it.
>

I have root-level access.
However, so far I only learned the BIOS way, which requires a reboot.

Doing it in Linux would be more convenient, avoiding reboots,
I suppose.
How do I do it in Linux.
Should I overwrite something in /proc ?
Something else.

> Some questions:
>
> - is the /tmp directory on your local disk?

Yes.
And there is plenty of room in the / filesystem and the
/tmp directory:

[root_at_spinoza ~]# ll -d /tmp
drwxrwxrwt 22 root root 4096 2010-05-05 12:36 /tmp

[root_at_spinoza ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_spinoza-lv_root
                       1.8T 504G 1.2T 30% /
tmpfs 24G 0 24G 0% /dev/shm
/dev/sda1 194M 40M 144M 22% /boot

FYI, this is a standalone workstation.
MPI is not being used over any network, private or local.
It is all "inside the box".

> - are there any revealing messages in
> /var/log/messages (or equivalent)
> about failures when the machine hangs?
>

Parsing kernel messages is not my favorite hobby or league.
In any case, as far as my search could go, there are just standard
kernel messages on /var/log/messages (e.g. ntpd synchronization, etc),
until the system hangs when the hello_c program fails.
Then the the log starts again with the boot process.
This behavior was repeated time and again over my several
attempts to run OpenMPI programs with the "sm" btl on.

***

However, I am suspicious of these kernel messages during boot.
Are they telling me of a memory misconfiguration, perhaps?
What do the "*BAD*gran_size: ..." mean?

Does anybody out there with a sane funnctional Nehalem system
get these funny "*BAD*gran_size: ..." lines
in " dmesg | more" output, or in /var/log/messages during boot?

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
total RAM covered: 49144M
  gran_size: 64K chunk_size: 64K num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 128K num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 256K num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 512K num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 1M num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 2M num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 4M num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 8M num_reg: 8 lose cover RAM: 45G
  gran_size: 64K chunk_size: 16M num_reg: 8 lose cover RAM: 0G
  gran_size: 64K chunk_size: 32M num_reg: 8 lose cover RAM: 0G
  gran_size: 64K chunk_size: 64M num_reg: 8 lose cover RAM: 0G
  gran_size: 64K chunk_size: 128M num_reg: 8 lose cover RAM: 0G
  gran_size: 64K chunk_size: 256M num_reg: 8 lose cover RAM: 0G
  gran_size: 64K chunk_size: 512M num_reg: 8 lose cover RAM: 0G
  gran_size: 64K chunk_size: 1G num_reg: 8 lose cover RAM: 0G
*BAD*gran_size: 64K chunk_size: 2G num_reg: 8 lose cover RAM: -1G
  gran_size: 128K chunk_size: 128K num_reg: 8 lose cover RAM: 45G
  gran_size: 128K chunk_size: 256K num_reg: 8 lose cover RAM: 45G
  gran_size: 128K chunk_size: 512K num_reg: 8 lose cover RAM: 45G
  gran_size: 128K chunk_size: 1M num_reg: 8 lose cover RAM: 45G
  gran_size: 128K chunk_size: 2M num_reg: 8 lose cover RAM: 45G
  gran_size: 128K chunk_size: 4M num_reg: 8 lose cover RAM: 45G
  gran_size: 128K chunk_size: 8M num_reg: 8 lose cover RAM: 45G
  gran_size: 128K chunk_size: 16M num_reg: 8 lose cover RAM: 0G
  gran_size: 128K chunk_size: 32M num_reg: 8 lose cover RAM: 0G
  gran_size: 128K chunk_size: 64M num_reg: 8 lose cover RAM: 0G
  gran_size: 128K chunk_size: 128M num_reg: 8 lose cover RAM: 0G
  gran_size: 128K chunk_size: 256M num_reg: 8 lose cover RAM: 0G
  gran_size: 128K chunk_size: 512M num_reg: 8 lose cover RAM: 0G
  gran_size: 128K chunk_size: 1G num_reg: 8 lose cover RAM: 0G
*BAD*gran_size: 128K chunk_size: 2G num_reg: 8 lose cover RAM: -1G

... and it goes on and on ... then stops with

*BAD*gran_size: 512M chunk_size: 2G num_reg: 8 lose cover RAM: -520M
  gran_size: 1G chunk_size: 1G num_reg: 6 lose cover RAM: 1016M
  gran_size: 1G chunk_size: 2G num_reg: 7 lose cover RAM: 1016M
  gran_size: 2G chunk_size: 2G num_reg: 5 lose cover RAM: 2040M
mtrr_cleanup: can not find optimal value
please specify mtrr_gran_size/mtrr_chunk_size

...

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

I know about the finicky memory configuration details
required by Nehalem, but I didn't put together this system,
or opened the box to see what is inside yet.

Kernel experts and Nehalem Pros:

If something sounds suspicious, please tell me, and I will
check if the memory modules are the right ones and correctly
distributed on the slots.

**

Thank you very much,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

>
>
> On May 4, 2010, at 8:35 PM, Gus Correa wrote:
>
>> Hi Douglas
>>
>> Yes, very helpful indeed!
>>
>> The machine here is a two-way quad-core, and /proc/cpuinfo shows 16
>> processors, twice as much as the physical cores,
>> just like you see on yours.
>> So, HT is turned on for sure.
>>
>> The security guard opened the office door for me,
>> and I could reboot that machine.
>> It's called Spinoza. Maybe that's why it is locked.
>> Now the door is locked again, so I will have to wait until tomorrow
>> to play around with the BIOS settings.
>>
>> I will remember the BIOS double negative that you pointed out:
>> "When Disabled only one thread per core is enabled"
>> Ain't that English funny?
>> So far, I can't get no satisfaction.
>> Hence, let's see if Ralph's suggestion works.
>> Never get no hyperthreading turned on,
>> and you ain't have no problems with Open MPI. :)
>>
>> Many thanks!
>> Have a great Halifax Spring time!
>>
>> Cheers,
>> Gus
>>
>> Douglas Guptill wrote:
>>> On Tue, May 04, 2010 at 05:34:40PM -0600, Ralph Castain wrote:
>>>> On May 4, 2010, at 4:51 PM, Gus Correa wrote:
>>>>
>>>>> Hi Ralph
>>>>>
>>>>> Ralph Castain wrote:
>>>>>> One possibility is that the sm btl might not like that you have hyperthreading enabled.
>>>>> I remember that hyperthreading was discussed months ago,
>>>>> in the previous incarnation of this problem/thread/discussion on "Nehalem vs. Open MPI".
>>>>> (It sounds like one of those supreme court cases ... )
>>>>>
>>>>> I don't really administer that machine,
>>>>> or any machine with hyperthreading,
>>>>> so I am not much familiar to the HT nitty-gritty.
>>>>> How do I turn off hyperthreading?
>>>>> Is it a BIOS or a Linux thing?
>>>>> I may try that.
>>>> I believe it can be turned off via an admin-level cmd, but I'm not certain about it
>>> The challenge was too great to resist, so I yielded, and rebooted my
>>> Nehalem (Core i7 920 @ 2.67 GHz) to confirm my thoughts on the issue.
>>>
>>> Entering the BIOS setup by pressing "DEL", and "right-arrowing" over
>>> to "Advanced", then "down arrow" to "CPU configuration", I found a
>>> setting called "Intel (R) HT Technology". The help dialogue says
>>> "When Disabled only one thread per core is enabled".
>>>
>>> Mine is "Enabled", and I see 8 cpus. The Core i7, to my
>>> understanding, is a 4 core chip.
>>>
>>> Hope that helps,
>>> Douglas.
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>