Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
From: Gus Correa (gus_at_[hidden])
Date: 2012-09-07 18:01:46


On 09/03/2012 04:39 PM, Andrea Negri wrote:
> max locked memory (kbytes, -l) 32
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 1024
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> stack size (kbytes, -s) 10240
>>

Hi Andrea
This is besides the possibilities of
running out of physical memory,
or even defective memory chips, which Jeff, Ralph,
John, George have addressed, I still think that the
system limits above may play a role.
In a 8-year old cluster, hardware failures are not unexpected.

1) System limits

For what it is worth, virtually none of the programs we run here,
mostly atmosphere/ocean/climate codes,
would run with these limits.
On our compute nodes we set
max locked memory and stack size to
unlimited, to avoid problems with symptoms very
similar to those you describe.
Typically there are lots of automatic arrays in subroutines,
etc, which require a large stack.
Your sys admin could add these lines to the bottom
of /etc/security/limits.conf [the last one sets the
max number of open files]:

* - memlock -1
* - stack -1
* - nofile 4096

2) Defective network interface/cable/switch port

Yet another possibility, following Ralph's suggestion,
is that you may have a failing network interface, or a bad
Ethernet cable or connector, on the node that goes south,
or on the switch port that serves that node.
[I am assuming your network is Ethernet, probably GigE.]

Again, in a 8-year old cluster, hardware failures are not unexpected.

We had this sort of problems with old clusters, old nodes.

3) Quarantine the bad node

Is it always the same node that fails, or does it vary?
[Please answer, it helps us understand what's going on.]

If it is always the same node, have you tried to quarantine it,
either temporarily removing it from your job submission system
or just turning it off, and run the job on the remaining
nodes?

I hope this helps,
Gus Correa