On Apr 14, 2011, at 9:22 AM, Rushton Martin wrote:
> For your information: we were supplied with a script when we bought the
> cluster, but the original script made the assumption that all processes
> and shm files belonging to a specific user ought to be deleted. This is
> a problem if users submit jobs which only half fill a node and the
> second job starts on the same node as the first one. The first job to
> finish causes the continuing job to stop dead. We therefore had to
> disable any cleanup to allow jobs to run. Now we are finding a slow
> fill up with the shm files and I need to do something; at least now I
> have a way forward.
Note that Open MPI v1.4.x is likely using mmap files by default -- these should be under /tmp/ somewhere. If they get left around, they can cause shared memory to be filled up, but they should also be unrelated in /dev/shm kinds of things. If you're seeing /dev/shm fill up, that might be due to something else.
Also, I'm a little confused by your reference to psm_shm... are you talking about the QLogic PSM device? If that does some tomfoolery with /dev/shm somewhere, I'm unaware of it (i.e., I don't know much/anything about what that device does internally).
For corporate legal information go to: