Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Gus Correa (gus_at_[hidden])
Date: 2010-05-06 20:36:16


Hi Jeff

Answers inline.

Jeff Squyres wrote:
> On May 6, 2010, at 2:01 PM, Gus Correa wrote:
>
>> 1) Now I can see and use the btl_sm_num_fifos component:
>>
>> I had committed already "btl = ^sm" to the openmpi-mca-params.conf
>> file. This apparently hides the btl_sm_num_fifos from ompi_info.
>>
>> After I switched to no options in openmpi-mca-params.conf,
>> then ompi_info showed the btl_sm_num_fifos component.
>>
>> ompi_info --all | grep btl_sm_num_fifos
>> MCA btl: parameter "btl_sm_num_fifos" (current value: "1", data source: default value)
>>
>> A side comment:
>> This means that the system administrator can
>> hide some Open MPI options from the users, depending on what
>> he puts in the openmpi-mca-params.conf file, right?
>
> Correct.
>
> BUT: a user can always override the "btl" MCA param and see them again. For example, you could also have done this:
>
> echo "btl =" > ~/.openmpi/mca-params.conf
> ompi_info --all | grep btl_sm_num_fifos
> # ...will show the sm params...
>

Aha!
Can they override my settings?!
Can't anymore.
I'm gonna write a BOFH cron script to run every 10 minutes,
check for and delete any ~/.openmpi directory,
shutdown the recalcitrant account, make a tarball of its ~ ,
and send it to the mass store. Quarantined. :)

>> 2) However, running with "sm" still breaks, unfortunately:
>>
>> Boomer!
>
> Doh!
>
>> I get the same errors that I reported in my very
>> first email, if I increase the number of processes to 16,
>> to explore the hyperthreading range.
>>
>> This is using "sm" (i.e. not excluded in the mca config file),
>> and btl_sm_num_fifos (mpiexec command line)
>>
>> The machine hangs, requires a hard reboot, etc, etc,
>> as reported earlier. See the below, please.
>
> I saw that only some probably-unrelated dmesg messages were emitted. Was there anything else revealing on the console and/or /var/log/* files? Hard reboots absolutely should not be caused by Open MPI.
>

I don't think the problem is with Open MPI.
So, it may not be easy to find a logical link between the kernel
messages and the MPI hello_c that was running.

>> So, I guess the conclusion is that I can use sm,
>> but I have to remain within the range of physical cores (8),
>> not oversubscribe, not try to explore the HT range.
>> Should I expect it to work also for np>number of physical cores?
>
> Your prior explanations of when HT is useful
> seemed pretty reasonable to me.
> Meaning: Nehalem HT will help only in some kinds of codes.
> Dense computation codes with few conditional branches may
> not benefit much from HT.
>

When there aren't frequent requests to change the code,
to include new features, one can think about optimizing for
dense computation, avoid inner loop branches, etc.
That is the situation reported by Doug Reeder on this thread,
where his optimized finite element code shows a 2/3 degraded
speed when HT is used.

However, most of the codes we run here seem to have been optimized at
some point of their early life, but then aggregated so many new
features that the if/elseif/elseif... branches are abundant.
The logic can get so complicated to de-tangle and streamline that
nobody dares to rewrite the code, afraid to produce wrong results,
or to have to face a long code re-development cycle (without support).
It is like fixing the plumbing or wiring of an old house.
OO that goes OOverboard also plays a role, often misses the
point, and can add more overhead.
I would guess that this is not a specific situation of
Earth Science applications (which tend to be big and complex).

So, chances are that hyperthreading may give us a little edge,
harnessing the code imperfections.
Not a big one, maybe 10-20%, I would guess.
I experienced that type of speedup with SMT/HT on an IBM machine
with one of these big codes.

> But OMPI applications should always run *correctly*,
> regardless of HT or not-HT -- even if you're oversubscribing.
> The performance may suffer (sometimes dramatically)
> if you oversubscribe physical cores with dense computational code,
> but it should always run *correctly*.
>

That is what I was seeking first place.
Not performance with HT, but correctness with HT.

Whether we would use HT or not was to be decided later,
after testing how the atmospheric model would perform
with and without HT.

>> I wonder if this would still work with np<=8, but with heavier code.
>> (I only used hello_c.c so far.)
>
> If hello_c is crashing your computer -
> even if you're running np>8 or np>16 --
> something is wrong outside of Open MPI.
> I routinely run np=100 hello_c on machines.
>

I've got hello_c to run correctly with heavy oversubscription on
our cluster nodes (up to 1024 on a 8-core node IIRR).
Heavier programs don't go this far, but still run with light
oversubscription.

But on that Nehalem + Fedora 12 machine it doesn't work.
So, the evidence is clear.
The problem is not with Open MPI.

>> $ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out
>> --------------------------------------------------------------------------
>> mpiexec noticed that process rank 8 with PID 3659 on node spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> $
>>
>> Message from syslogd_at_spinoza at May 6 13:38:13 ...
>> kernel:------------[ cut here ]------------
>>
>> Message from syslogd_at_spinoza at May 6 13:38:13 ...
>> kernel:invalid opcode: 0000 [#1] SMP
>>
>> Message from syslogd_at_spinoza at May 6 13:38:13 ...
>> kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/physical_package_id
>>
>> Message from syslogd_at_spinoza at May 6 13:38:13 ...
>> kernel:Stack:
>>
>> Message from syslogd_at_spinoza at May 6 13:38:13 ...
>> kernel:Call Trace:
>>
>> Message from syslogd_at_spinoza at May 6 13:38:13 ...
>> kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01
>
> I unfortunately don't know what these messages mean...
>

I think the last one is hexa for Dante Alighieri's Inferno:

"Lasciate ogni speranza voi ch'entrate"

Cheers,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------