Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
From: Gus Correa (gus_at_[hidden])
Date: 2012-09-07 18:12:20


On 09/07/2012 08:02 AM, Jeff Squyres wrote:
> On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:
>
>> Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on.
>
>
> Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings.
>
> Do you have any hardware diagnostics from your server
> vendor that you can run?
>

If you don't have a vendor provided diagnostic tool,
you or your sys admin could try Advanced Clustering "breakin":

http://www.advancedclustering.com/our-software/view-category.html

Download the ISO version, burn a CD, put in the node CD drive,
assuming it has one, reboot, chose breakin in the menu options.
If there is no CD drive, there is an alternative with network boot,
although more involved.

I hope it helps,
Gus Correa

> A simple way to test your RAM (it's not completely comprehensive, but it does check for a surprisingly wide array of memory issues) is to do something like this (pseudocode):
>
> -----
> size_t i, size, increment;
> increment = 1GB;
> size = 1GB;
> int *ptr;
>
> // Find the biggest amount of memory that you can malloc
> while (increment>= 1024) {
> ptr = malloc(size);
> if (NULL != ptr) {
> free(ptr);
> size += increment;
> } else {
> size -= increment;
> increment /= 2;
> }
> }
> printf("I can malloc %lu bytes\n", size);
>
> // Malloc that huge chunk of memory
> ptr = malloc(size);
> for (i = 0; i< size / sizeof(int); ++i, ++ptr) {
> *ptr = 37;
> if (*ptr != 37) {
> printf("Readback error!\n");
> }
> }
>
> printf("All done\n");
> -----
>
> Depending on how much memory you have,
that might take a little while to run
(all the memory has to be paged in, etc.).
You might want to add a status output to show progress,
and/or write/read a page at a time for better efficiency, etc.
But you get the idea.
>
> Hope that helps.
>