Jed Brown escribió:
> On Mon 2008-09-29 20:30, Leonardo Fialho wrote:
>> 1) If I use one node (8 cores) the "user" % is around 100% per core. The
>> execution time is around 430 seconds.
>> 2) If I use 2 nodes (4 cores in each node) the "user" % is around 95%
>> per core and the "sys" % is 5%. The execution time is around 220 seconds.
>> 3) If I use 4 nodes (*2* cores in each node) the "user" % is around %85
>> per core and the "sys" % is 15%. The execution time is around 200
> Do you mean 2 cores per node (1 core per socket).
>> Well... the questions are:
>> A) The execution time in case "1" should be smaller (only sm
>> communication, no?) than case "2" and "3", no? Cache problems?
> Is this benchmark memory bandwidth limited? Your results are fairly
> typical for sparse matrix kernels. One core can more or less saturate
> the bus on its own, two cores can overlap memory access so it doesn't
> hurt too much, more than two and they are all waiting on memory. The
> extra cores are cheaper than more sockets but they don't do much/any
> good for many workloads.
>> B) Why the "sys" time while using communication inter nodes? NIC driver?
>> Why this time increase when I balance the load across the nodes?
> Messages over Ethernet cost more than messages in shared memory. When
> you only use 1 core per socket, the application is faster because the
> single thread has the full memory bandwidth to itself, however MPI needs
> to move more data over the wire so that phase costs more. If your
> network was faster (e.g. InfiniBand) you could expect the communication
> to stay quite cheap even with only one process per node.
The nodes have 2 sockets with 4 cores in each.
In other words... in this case ("2" and "3"), the concurrency for the
bus/memory by more than 2 tasks is worser than the Giga Ethernet?
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088