Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] CP2K mpi hang
From: Noam Bernstein (noam.bernstein_at_[hidden])
Date: 2009-05-19 12:27:23


On May 19, 2009, at 12:13 PM, Ashley Pittman wrote:

> On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote:
>
>> I'd suspect the filesystem too, except that it's hung up in an MPI
>> call. As I said
>> before, the whole thing is bizarre. It doesn't matter where the
>> executable is,
>> just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun /home/
>> bernstei/exec,
>> but if it's sitting in /scratch it'll hang). And I've been running
>> other codes both from NFS and from scratch directories for months,
>> and never had a problem.
>
> That is indeed odd but it shouldn't be too hard to track down, how
> often
> does the failure occur? Presumably when you say you have three
> invocations of the program they communicate via files, is the location
> of these files changing?

The hang is completely repeatable. Every time I run in the
configuration
that hangs, the third invokation hangs (i.e. running from the scratch
directory).
The three invokations happen like this:
   serial code creates subdirectory, write input files, call
"mpirun ...", reads output files
   repeat, in a new subdirectory (except that the input files are
configured to do a rather different calculation)
   repeat again, in a new subdirectory (back to the first type of
calculation).

I can try to save the input and output files and recreate the process
by running mpirun myself, rather that via the serial code.

>
> I assume you're certain it's actually hanging and not just failing to
> converge?

Yes - no output is generated for a very long time, and attaching
with gdb shows that it's stuck in basically one place.

>
>
> Finally if you could run it with "--mca btl ^ofed" to rule out the
> ofed
> stack causing the problem that would be useful. You'd need to check
> the
> syntax here.
>

I'll try that (or actually the corrected syntax in the next message.

>
> This isn't so suspicious, if there is a problem with some processes
> it's
> common for other processes to continue till the next collective call.

Yeah, I guess so.

                                                                                                                Noam