Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Execution in multicore machines
From: Torje Henriksen (torjeh_at_[hidden])
Date: 2008-09-30 13:34:31


If they are 8 core Intel machines, I believe this is the case:

*) Each pair of cores share an L2-cache. So using two cores that share
cache will probably reduce performance.
*) Each Quad core CPU has its own memory bus (Dual independent bus),
so using more than one core on a quad CPU can reduce performance if
the bus is a bottle neck.

In your first case, both L2-cache and memory bus are shared. In your
second case, only memory bus is shared. In your third case no L2-cache
or memory bus are shared (The Linux scheduler maps processes so that
they run on different CPUs if possible).

If you want another performance-case, you can map the processes such
that they run on 4 different nodes, but share L2-cache. This can be
done by something like this mpirun -n 8 taskset -cp 0,4 LU.C.8 . Core
ID 0 and 4 share L2-cache on our system at least. I guess you are not
that interested, but it is possible! :)

In addition I don't believe there is very much communication happening
in the LU-benchmark compared to the other NAS benchmarks.

All in all, I agree with both of you. Both the L2-cache and the memory
bus are probably slowing you down.

As for the sys% time, I believe it is the NIC driver. The more inter-
node communication, the more sys%. The shared memory communication
module (BTL SM) does all its communication in user space, as you

Best regards,

-Torje S. Henriksen

On Sep 30, 2008, at 6:55 PM, Jeff Squyres wrote:

> Are these intel-based machines? I have seen similar effects
> mentioned earlier in this thread where having all 8 cores banging on
> memory pretty much kills performance on the UMA-style intel 8 core
> machines. I'm not a hardware expert, but I've stayed away from
> buying 8-core servers for exactly this reason. AMD's been NUMA all
> along, and Intel's newer chips are NUMA to alleviate some of this
> bus pressure.
> ~2x performance loss (between 8 and 4 cores on a single node) seems
> a bit excessive, but I guess it could happen...? (I don't have any
> hard numbers either way)
> On Sep 29, 2008, at 2:30 PM, Leonardo Fialho wrote:
>> Hi All,
>> I´m doing some probes in a multi core (8 cores per node) machine
>> with NAS benchmarks. Something that I consider strange is
>> occurring...
>> I´m using only one NIC and paffinity:
>> ./bin/mpirun
>> -n 8
>> --hostfile ./hostfile
>> --mca mpi_paffinity_alone 1
>> --mca btl_tcp_if_include eth1
>> --loadbalance
>> ./codes/nas/NPB3.3/NPB3.3-MPI/bin/lu.C.8
>> I have sufficient memory to run this application in only one node,
>> but:
>> 1) If I use one node (8 cores) the "user" % is around 100% per
>> core. The execution time is around 430 seconds.
>> 2) If I use 2 nodes (4 cores in each node) the "user" % is around
>> 95% per core and the "sys" % is 5%. The execution time is around
>> 220 seconds.
>> 3) If I use 4 nodes (1 cores in each node) the "user" % is around
>> %85 per core and the "sys" % is 15%. The execution time is around
>> 200 seconds.
>> Well... the questions are:
>> A) The execution time in case "1" should be smaller (only sm
>> communication, no?) than case "2" and "3", no? Cache problems?
>> B) Why the "sys" time while using communication inter nodes? NIC
>> driver? Why this time increase when I balance the load across the
>> nodes?
>> Thanks,
>> --
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> --
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> users mailing list
> users_at_[hidden]