On May 19, 2009, at 12:13 PM, Ashley Pittman wrote:
> On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote:
>> I'd suspect the filesystem too, except that it's hung up in an MPI
>> call. As I said
>> before, the whole thing is bizarre. It doesn't matter where the
>> executable is,
>> just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun /home/
>> but if it's sitting in /scratch it'll hang). And I've been running
>> other codes both from NFS and from scratch directories for months,
>> and never had a problem.
> That is indeed odd but it shouldn't be too hard to track down, how
> does the failure occur? Presumably when you say you have three
> invocations of the program they communicate via files, is the location
> of these files changing?
The hang is completely repeatable. Every time I run in the
that hangs, the third invokation hangs (i.e. running from the scratch
The three invokations happen like this:
serial code creates subdirectory, write input files, call
"mpirun ...", reads output files
repeat, in a new subdirectory (except that the input files are
configured to do a rather different calculation)
repeat again, in a new subdirectory (back to the first type of
I can try to save the input and output files and recreate the process
by running mpirun myself, rather that via the serial code.
> I assume you're certain it's actually hanging and not just failing to
Yes - no output is generated for a very long time, and attaching
with gdb shows that it's stuck in basically one place.
> Finally if you could run it with "--mca btl ^ofed" to rule out the
> stack causing the problem that would be useful. You'd need to check
> syntax here.
I'll try that (or actually the corrected syntax in the next message.
> This isn't so suspicious, if there is a problem with some processes
> common for other processes to continue till the next collective call.
Yeah, I guess so.