Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fw: Re: Open MPI timeout problems.
From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2008-06-19 17:38:45


> I appreciate the feedback. I'm assuming that this upgrade to the Open
> Fabric
> driver is something that the System Admin. of the cluster should be
> concerned with and not I ?
Driver upgrade will require root permissions.
Thanks,
Pasha

>
> Thanks,
>
> Peter
>
> Peter Diamessis wrote:
>>
>>
>> --- On *Thu, 6/19/08, Pavel Shamis (Pasha)
>> /<pasha_at_[hidden]>/* wrote:
>>
>> From: Pavel Shamis (Pasha) <pasha_at_[hidden]>
>> Subject: Re: [OMPI users] Open MPI timeout problems.
>> To: pjd38_at_[hidden], "Open MPI Users" <users_at_[hidden]>
>> Date: Thursday, June 19, 2008, 5:20 AM
>>
>> Usually the retry exceed point to some network issue on your
>> cluster. I see from the logs that you still
>> use MVAPI. If i remember correct, MVAPI include IBADM
>> application that should be able to check and debug the network.
>> BTW I recommend you to update your MVAPI driver to latest
>> OpenFabric driver.
>>
>> Peter Diamessis wrote:
>> > Dear folks,
>> >
>> > I would appreciate your help on the following:
>> >
>> > I'm running a parallel CFD code on the Army Research Lab's MJM
>> Linux
>> > cluster, which uses Open-MPI. I've run the same code on other
>> Linux
>> > clusters that use MPICH2 and had never run into this problem.
>> >
>> > I'm quite convinced that the bottleneck for my code is this data
>> > transposition routine, although I have not done any rigorous
>> profiling
>> > to check on it. This is where 90% of the parallel communication
>> takes
>> > place. I'm running a CFD code that uses a 3-D rectangular
>> domain which
>> > is partitioned across processors in such a way that each processor
>> > stores vertical slabs that are contiguous in the x-direction
>> but shared
>> > across processors in the y-dir. . When a 2-D Fast Fourier
>> Transform
>> > (FFT) needs to be done, data is transposed such that the
>> vertical slabs
>> > are now contiguous in the y-dir. in each processor. >
>> > The code would normally be run for about 10,000 timesteps. In the
>> > specific case which blocks, the job crashes after ~200
>> timesteps and at
>> > each timestep a large number of 2-D FFTs are performed. For a
>> domain
>> > with resolution of Nx * Ny * Nz points and P processors, during
>> one FFT,
>> > each processor performs P Sends and P Receives of a message of
>> size
>> > (Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such
>> Sends/Receives. >
>> > I've focused on a case using P=32 procs with Nx=256, Ny=128,
>> Nz=175.
>> You
>> > can see that each FFT involves 2048 communications. I totally
>> rewrote my
>> > data transposition routine to no longer use specific blocking/non-
>> > blocking Sends/Receives but to use MPI_ALLTOALL which I would
>> hope is
>> > optimized for the specific MPI Implementation to do data
>> transpositions.
>> > Unfortunately, my code still crashes with time-out problems
>> like before.
>> >
>> > This happens for P=4, 8, 16 & 32 processors. The same MPI_ALLTOALL
>> code
>> > worked fine on a smaller cluster here. Note that in the future
>> I would
>> > like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and
>> P=128 or
>> > 256 procs. which will involve an order of magnitude more
>> communication.
>> >
>> > Note that I ran the job by submitting it to an LSF queue
>> system. I've
>> > attached the script file used for that. I basically enter bsub
>> -x <
>> > script_openmpi at the command line. >
>> > When I communicated with a consultant at ARL, he recommended I use
>> > 3 specific script files which I've attached. I believe these
>> enable
>> > control over some of the MCA parameters. I've experimented with
>> values
>> > of btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still
>> have this
>> > problem. I am still in contact with this consultant but thought
>> it would
>> > be good to contact you folks directly.
>> >
>> > Note:
>> > a) echo $PATH returns: >
>> > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
>> >
>> /opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
>> > ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
>> > ia32e/etc:/usr/cta/modules/3.1.6/bin:
>> >
>> /usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:
>> > .:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin
>> >
>> > b) echo $LD_LIBRARY_PATH returns:
>> > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
>> > /opt/compiler/pgi/linux86-64/6.2/lib:
>> >
>> /opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-glibc2.3-
>> > ia32e/lib
>> >
>> > I've attached the following files:
>> > 1) Gzipped versions of the .out & .err files of the failed job.
>> > 2) ompi_info.log: The output of ompi_info -all
>> > 3) mpirun, mpirun.lsf, openmpi_wrapper: the three script files
>> provided
>> > to me by the ARL consultant. I store these in my home directory
>> and
>> > experimented with the MCA parameter btl_mvapi_ib_timeout in
>> mpirun.
>> > 4) The script file script_openmpi that I use to submit the job.
>> >
>> > I am unable to provide you with the config.log file as I cannot
>> find it
>> > in the top level Open MPI directory.
>> >
>> > I am also unable to provide you with details on the specific
>> cluster
>> > that I'm running in terms of the network. I know they use
>> Infiniband
>> and
>> > some more detail may be found on:
>> >
>> > http://www.arl.hpc.mil/Systems/mjm.html
>> >
>> > Some other info:
>> > a) uname -a returns: > Linux l1 2.6.5-7.308-smp.arl-msrc #2
>> SMP Thu Jan 10 09:18:41 EST 2008
>> > x86_64 x86_64 x86_64 GNU/Linux
>> >
>> > b) ulimit -l returns: unlimited
>> >
>> > I cannot see a pattern as to which nodes are bad and which are
>> good ...
>> >
>> >
>> > Note that I found in the mail archives that someone had a similar
>> > problem in transposing a matrix with 16 million elements. The only
>> > answer I found in the thread was to increase the value of
>> > btl_mvapi_ib_timeout to 14 or 16, something I've done already.
>> >
>> > I'm hoping that there must be a way out of this problem. I need to
>> > get my code running as I'm under pressure to produce results for a
>> > grant that's paying me.
>> >
>> > If you have any feedback I would be hugely grateful.
>> >
>> > Sincerely,
>> >
>> > Peter Diamessis
>> > Cornell University
>> >
>> >
>> > >
>> ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>