On Aug 16, 2008, at 1:40 PM, Brian Dobbins wrote:
> Hi guys,
> I was hoping someone here could shed some light on OpenMPI's use
> of /tmp (or, I guess, TMPDIR) and save me from diving into the
> source.. ;)
> The background is that I'm trying to run some applications on a
> system which has a flaky parallel file system which TMPDIR is mapped
> to - so, on start up, OpenMPI creates it's 'openmpi-sessions-<user>'
> directory there, and under that, a few files. Sometimes I see 1
> subdirectory (usually a 0), sometimes a 0 and a 1, etc. In one of
> these, sometimes I see files such as 'shared_memory_pool.<host>',
> and 'shared_memory_module.<host>'.
> My questions are, one, what are the various numbers / files for?
> (If there's a write-up on this somewhere, just point me towards it!)
The numbers just correspond to the jobid and vpid of the processes on
the node. We use them to ensure that each process has its own
"trusted" location where it can store tmp files without concerns for
stepping on each other. These directories generally do not get used
except for storing the shared memory files and for debugging output in
the case of an internal OMPI error.
The shared_memory files are backing files for shared memory operations.
> And two, the real question, are these 'files' used during runtime,
> or only upon startup / shutdown? I'm having issues with various
> codes, especially those heavy on messages and I/O, failing to
> complete a run, and haven't resorted to sifting through strace's
> output yet. This doesn't happen all the time, but I've seen it
> happen reliably now with one particular code - it's success rate (it
> DOES succeed sometimes) is about 25% right now. My best guess is
> that this is because the file system is overloaded, thus not
> allowing timely I/O or access to OpenMPI's files, but I wanted to
> get a quick understanding of how these files are used by OpenMPI and
> whether the FS does indeed seem a likely culprit before going with
> that theory for sure.
I would guess that you are having a problem with shared memory
operations. Try using "-mca btl ^sm" on your cmd line to turn off
shared memory and see if your success rate goes up - if so, then you
have identified the problem!
> Thanks very much,
> - Brian
> Brian Dobbins
> Yale Engineering HPC
> users mailing list