I would appreciate your help on the following:
I'm running a parallel CFD code on the Army Research Lab's MJM Linux
cluster, which uses Open-MPI. I've run the same code on other Linux
clusters that use MPICH2 and had never run into this problem.
I'm quite convinced that the bottleneck for my code is this data
transposition routine, although I have not done any rigorous profiling
to check on it. This is where 90% of the parallel communication takes
place. I'm running a CFD code that uses a 3-D rectangular domain which
is partitioned across processors in such a way that each processor
stores vertical slabs that are contiguous in the x-direction but shared
across processors in the y-dir. . When a 2-D Fast Fourier Transform
(FFT) needs to be done, data is transposed such that the vertical slabs
are now contiguous in the y-dir. in each processor.
The code would normally be run for about 10,000 timesteps. In the
specific case which blocks, the job crashes after ~200 timesteps and at
each timestep a large number of 2-D FFTs are performed. For a domain
with resolution of Nx * Ny * Nz points and P processors, during one FFT,
each processor performs P Sends and P Receives of a message of size
(Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such Sends/Receives.
I've focused on a case using P=32 procs with Nx=256, Ny=128, Nz=175. You
can see that each FFT involves 2048 communications. I totally rewrote my
data transposition routine to no longer use specific blocking/non-
blocking Sends/Receives but to use MPI_ALLTOALL which I would hope is
optimized for the specific MPI Implementation to do data transpositions.
Unfortunately, my code still crashes with time-out problems like before.
This happens for P=4, 8, 16 & 32 processors. The same MPI_ALLTOALL code
worked fine on a smaller cluster here. Note that in the future I would
like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and P=128 or
256 procs. which will involve an order of magnitude more communication.
Note that I ran the job by submitting it to an LSF queue system. I've
attached the script file used for that. I basically enter bsub -x <
script_openmpi at the command line.
When I communicated with a consultant at ARL, he recommended I use
3 specific script files which I've attached. I believe these enable
control over some of the MCA parameters. I've experimented with values
of btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still have this
problem. I am still in contact with this consultant but thought it would
be good to contact you folks directly.
a) echo $PATH returns:
b) echo $LD_LIBRARY_PATH returns:
I've attached the following files:
1) Gzipped versions of the .out & .err files of the failed job.
2) ompi_info.log: The output of ompi_info -all
3) mpirun, mpirun.lsf, openmpi_wrapper: the three script files provided
to me by the ARL consultant. I store these in my home directory and
experimented with the MCA parameter btl_mvapi_ib_timeout in mpirun.
4) The script file script_openmpi that I use to submit the job.
I am unable to provide you with the config.log file as I cannot find it
in the top level Open MPI directory.
I am also unable to provide you with details on the specific cluster
that I'm running in terms of the network. I know they use Infiniband and
some more detail may be found on:
Some other info:
a) uname -a returns:
Linux l1 2.6.5-7.308-smp.arl-msrc #2 SMP Thu Jan 10 09:18:41 EST 2008
x86_64 x86_64 x86_64 GNU/Linux
b) ulimit -l returns: unlimited
I cannot see a pattern as to which nodes are bad and which are good ...
Note that I found in the mail archives that someone had a similar
problem in transposing a matrix with 16 million elements. The only
answer I found in the thread was to increase the value of
btl_mvapi_ib_timeout to 14 or 16, something I've done already.
I'm hoping that there must be a way out of this problem. I need to
get my code running as I'm under pressure to produce results for a
grant that's paying me.
If you have any feedback I would be hugely grateful.
- application/x-shellscript attachment: mpirun