On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:
> Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on.
Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings.
Do you have any hardware diagnostics from your server vendor that you can run?
A simple way to test your RAM (it's not completely comprehensive, but it does check for a surprisingly wide array of memory issues) is to do something like this (pseudocode):
-----
size_t i, size, increment;
increment = 1GB;
size = 1GB;
int *ptr;
// Find the biggest amount of memory that you can malloc
while (increment >= 1024) {
ptr = malloc(size);
if (NULL != ptr) {
free(ptr);
size += increment;
} else {
size -= increment;
increment /= 2;
}
}
printf("I can malloc %lu bytes\n", size);
// Malloc that huge chunk of memory
ptr = malloc(size);
for (i = 0; i < size / sizeof(int); ++i, ++ptr) {
*ptr = 37;
if (*ptr != 37) {
printf("Readback error!\n");
}
}
printf("All done\n");
-----
Depending on how much memory you have, that might take a little while to run (all the memory has to be paged in, etc.). You might want to add a status output to show progress, and/or write/read a page at a time for better efficiency, etc. But you get the idea.
Hope that helps.
--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
|