> We don't really need a finer grain knowledge about the processor at
> compile time.
There are some other open-source projects which have already done
something very similar if not identical; one of them is the media
player mplayer (http://www.mplayerhq.hu/). Why not using these as
starting points ?
> The second question is how and when to figure out which of the
> available memcpy functions give the best performance.
This depends a lot on whether the job has the nodes all by itself or
the nodes are shared with other jobs - if so, the data transfer
between CPU and RAM while benchmarking can be significantly skewed.
> On a homogeneous architecture, this might be a one node selection [I
> don't imagine using the modex to spread this information]
Hmm, doesn't sound nice to have n-1 nodes waiting while 1 node does
the test. Maybe run it on all nodes and compare results ? And warn the
user if different mempcy versions would be chosen..
> The really annoying thing here, is that in the best case [in a
> perfect world] this should be done once per cluster.
... and, in the view of node sharing pointed above, when the
benchmarking can have the nodes all by itself. This sounds very much
like the collectives tuning, with MCA params to give the admin or user
view of how the best performance can be achieved.
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850