Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Execution in multicore machines
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-09-29 15:01:52


Jed Brown escribió:
> On Mon 2008-09-29 20:30, Leonardo Fialho wrote:
>
>> 1) If I use one node (8 cores) the "user" % is around 100% per core. The
>> execution time is around 430 seconds.
>>
>> 2) If I use 2 nodes (4 cores in each node) the "user" % is around 95%
>> per core and the "sys" % is 5%. The execution time is around 220 seconds.
>>
>> 3) If I use 4 nodes (*2* cores in each node) the "user" % is around %85
>> per core and the "sys" % is 15%. The execution time is around 200
>> seconds.
>>
> Do you mean 2 cores per node (1 core per socket).
>
Exactly, sorry.
>> Well... the questions are:
>>
>> A) The execution time in case "1" should be smaller (only sm
>> communication, no?) than case "2" and "3", no? Cache problems?
>>
> Is this benchmark memory bandwidth limited? Your results are fairly
> typical for sparse matrix kernels. One core can more or less saturate
> the bus on its own, two cores can overlap memory access so it doesn't
> hurt too much, more than two and they are all waiting on memory. The
> extra cores are cheaper than more sockets but they don't do much/any
> good for many workloads.
>
>> B) Why the "sys" time while using communication inter nodes? NIC driver?
>> Why this time increase when I balance the load across the nodes?
>>
> Messages over Ethernet cost more than messages in shared memory. When
> you only use 1 core per socket, the application is faster because the
> single thread has the full memory bandwidth to itself, however MPI needs
> to move more data over the wire so that phase costs more. If your
> network was faster (e.g. InfiniBand) you could expect the communication
> to stay quite cheap even with only one process per node.
>
The nodes have 2 sockets with 4 cores in each.

In other words... in this case ("2" and "3"), the concurrency for the
bus/memory by more than 2 tasks is worser than the Giga Ethernet?

-- 
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478