On Mon 2008-09-29 20:30, Leonardo Fialho wrote:
> 1) If I use one node (8 cores) the "user" % is around 100% per core. The
> execution time is around 430 seconds.
> 2) If I use 2 nodes (4 cores in each node) the "user" % is around 95%
> per core and the "sys" % is 5%. The execution time is around 220 seconds.
> 3) If I use 4 nodes (1 cores in each node) the "user" % is around %85
> per core and the "sys" % is 15%. The execution time is around 200
Do you mean 2 cores per node (1 core per socket).
> Well... the questions are:
> A) The execution time in case "1" should be smaller (only sm
> communication, no?) than case "2" and "3", no? Cache problems?
Is this benchmark memory bandwidth limited? Your results are fairly
typical for sparse matrix kernels. One core can more or less saturate
the bus on its own, two cores can overlap memory access so it doesn't
hurt too much, more than two and they are all waiting on memory. The
extra cores are cheaper than more sockets but they don't do much/any
good for many workloads.
> B) Why the "sys" time while using communication inter nodes? NIC driver?
> Why this time increase when I balance the load across the nodes?
Messages over Ethernet cost more than messages in shared memory. When
you only use 1 core per socket, the application is faster because the
single thread has the full memory bandwidth to itself, however MPI needs
to move more data over the wire so that phase costs more. If your
network was faster (e.g. InfiniBand) you could expect the communication
to stay quite cheap even with only one process per node.
- application/pgp-signature attachment: stored