> A) The execution time in case "1" should be smaller (only sm
> communication, no?) than case "2" and "3", no? Cache problems?
Shot in the dark from working on Sun T1 (also 8 real cores): from time
to time the OS wants to do something (interrupt handling, wake up
cron, ...). Leaving one or two cores spare for that purpose sometimes
yields much better performance (no context switches for OS anymore).
> B) Why the "sys" time while using communication inter nodes? NIC
That does not seem to be an uncommon value for ethernet NIC driver and
TCP/IP stack (but depends on the specific hardware (e.g. on-board
ethernet cards are worse than "real" ones; infiniband etc. is better
than ethernet, ...) and the amount of messages which depends on the
algorith). Depending on how you've taken measure/which OS/kernel/...
maybe that time consists of the time a driver waits for something to
happen on the network, too.
> Why this time increase when I balance the load across the
The more nodes you use, the more communication between them takes
place, so the more "parties" have to sync with each other, so the more
overhead is generated.