Apologies, I have not taken the time to read your comprehensive diagnostics!
As Gus says, this sounds like a memory problem.
My suspicion would be the kernel Out Of Memory (OOM) killer.
Log into those nodes (or ask your systems manager to do this). Look
closely at /var/log/messages where there will be notifications when
the OOM Killer kicks in and - well - kills large memory processes!
Grep for "killed process" in /var/log/messages