On Nov 26, 2012, at 4:02 AM, George Markomanolis wrote:
> Another more generic question, is about discovering nodes with faulty memory. Is there any way to identify nodes with faulty memory? I found accidentally that a node with exact the same hardware couldn't execute an MPI application when it was using more than 12GB of ram while the second one could use all of the 48GB of memory. If I have 500+ nodes is difficult to check all of them and I am not familiar with any efficient solution. Initially I thought about memtester but it takes a lot of time. I know that this does not apply exactly on this mailing list but I thought that maybe an OpenMPI user knows something about.
You really do want something like a memory tester. MPI applications *might* beat on your memory to identify errors, but that's really just a side effect of HPC access patterns. You really want a dedicated memory tester.
If such a memory tester takes a long time, you might want to use mpirun to launch it on multiple nodes simultaneously to save some time...?
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/