On 09/03/2012 04:39 PM, Andrea Negri wrote:
> max locked memory (kbytes, -l) 32
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 1024
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> stack size (kbytes, -s) 10240
This is besides the possibilities of
running out of physical memory,
or even defective memory chips, which Jeff, Ralph,
John, George have addressed, I still think that the
system limits above may play a role.
In a 8-year old cluster, hardware failures are not unexpected.
1) System limits
For what it is worth, virtually none of the programs we run here,
mostly atmosphere/ocean/climate codes,
would run with these limits.
On our compute nodes we set
max locked memory and stack size to
unlimited, to avoid problems with symptoms very
similar to those you describe.
Typically there are lots of automatic arrays in subroutines,
etc, which require a large stack.
Your sys admin could add these lines to the bottom
of /etc/security/limits.conf [the last one sets the
max number of open files]:
* - memlock -1
* - stack -1
* - nofile 4096
2) Defective network interface/cable/switch port
Yet another possibility, following Ralph's suggestion,
is that you may have a failing network interface, or a bad
Ethernet cable or connector, on the node that goes south,
or on the switch port that serves that node.
[I am assuming your network is Ethernet, probably GigE.]
Again, in a 8-year old cluster, hardware failures are not unexpected.
We had this sort of problems with old clusters, old nodes.
3) Quarantine the bad node
Is it always the same node that fails, or does it vary?
[Please answer, it helps us understand what's going on.]
If it is always the same node, have you tried to quarantine it,
either temporarily removing it from your job submission system
or just turning it off, and run the job on the remaining
I hope this helps,