Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-07 08:02:10


On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:

> Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on.

Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings.

Do you have any hardware diagnostics from your server vendor that you can run?

A simple way to test your RAM (it's not completely comprehensive, but it does check for a surprisingly wide array of memory issues) is to do something like this (pseudocode):

-----
size_t i, size, increment;
increment = 1GB;
size = 1GB;
int *ptr;

// Find the biggest amount of memory that you can malloc
while (increment >= 1024) {
    ptr = malloc(size);
    if (NULL != ptr) {
         free(ptr);
         size += increment;
    } else {
         size -= increment;
         increment /= 2;
    }
}
printf("I can malloc %lu bytes\n", size);

// Malloc that huge chunk of memory
ptr = malloc(size);
for (i = 0; i < size / sizeof(int); ++i, ++ptr) {
    *ptr = 37;
    if (*ptr != 37) {
        printf("Readback error!\n");
    }
}

printf("All done\n");
-----

Depending on how much memory you have, that might take a little while to run (all the memory has to be paged in, etc.). You might want to add a status output to show progress, and/or write/read a page at a time for better efficiency, etc. But you get the idea.

Hope that helps.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/