Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Fw: Re: Open MPI timeout problems.
From: Pavel Shamis (Pasha) (pasha_at_[hidden])
Date: 2008-06-19 17:38:45


> I appreciate the feedback. I'm assuming that this upgrade to the Open
> Fabric
> driver is something that the System Admin. of the cluster should be
> concerned with and not I ?
Driver upgrade will require root permissions.
Thanks,
Pasha

>
> Thanks,
>
> Peter
>
> Peter Diamessis wrote:
>>
>>
>> --- On *Thu, 6/19/08, Pavel Shamis (Pasha)
>> /<pasha_at_[hidden]>/* wrote:
>>
>> From: Pavel Shamis (Pasha) <pasha_at_[hidden]>
>> Subject: Re: [OMPI users] Open MPI timeout problems.
>> To: pjd38_at_[hidden], "Open MPI Users" <users_at_[hidden]>
>> Date: Thursday, June 19, 2008, 5:20 AM
>>
>> Usually the retry exceed point to some network issue on your
>> cluster. I see from the logs that you still
>> use MVAPI. If i remember correct, MVAPI include IBADM
>> application that should be able to check and debug the network.
>> BTW I recommend you to update your MVAPI driver to latest
>> OpenFabric driver.
>>
>> Peter Diamessis wrote:
>> > Dear folks,
>> >
>> > I would appreciate your help on the following:
>> >
>> > I'm running a parallel CFD code on the Army Research Lab's MJM
>> Linux
>> > cluster, which uses Open-MPI. I've run the same code on other
>> Linux
>> > clusters that use MPICH2 and had never run into this problem.
>> >
>> > I'm quite convinced that the bottleneck for my code is this data
>> > transposition routine, although I have not done any rigorous
>> profiling
>> > to check on it. This is where 90% of the parallel communication
>> takes
>> > place. I'm running a CFD code that uses a 3-D rectangular
>> domain which
>> > is partitioned across processors in such a way that each processor
>> > stores vertical slabs that are contiguous in the x-direction
>> but shared
>> > across processors in the y-dir. . When a 2-D Fast Fourier
>> Transform
>> > (FFT) needs to be done, data is transposed such that the
>> vertical slabs
>> > are now contiguous in the y-dir. in each processor. >
>> > The code would normally be run for about 10,000 timesteps. In the
>> > specific case which blocks, the job crashes after ~200
>> timesteps and at
>> > each timestep a large number of 2-D FFTs are performed. For a
>> domain
>> > with resolution of Nx * Ny * Nz points and P processors, during
>> one FFT,
>> > each processor performs P Sends and P Receives of a message of
>> size
>> > (Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such
>> Sends/Receives. >
>> > I've focused on a case using P=32 procs with Nx=256, Ny=128,
>> Nz=175.
>> You
>> > can see that each FFT involves 2048 communications. I totally
>> rewrote my
>> > data transposition routine to no longer use specific blocking/non-
>> > blocking Sends/Receives but to use MPI_ALLTOALL which I would
>> hope is
>> > optimized for the specific MPI Implementation to do data
>> transpositions.
>> > Unfortunately, my code still crashes with time-out problems
>> like before.
>> >
>> > This happens for P=4, 8, 16 & 32 processors. The same MPI_ALLTOALL
>> code
>> > worked fine on a smaller cluster here. Note that in the future
>> I would
>> > like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and
>> P=128 or
>> > 256 procs. which will involve an order of magnitude more
>> communication.
>> >
>> > Note that I ran the job by submitting it to an LSF queue
>> system. I've
>> > attached the script file used for that. I basically enter bsub
>> -x <
>> > script_openmpi at the command line. >
>> > When I communicated with a consultant at ARL, he recommended I use
>> > 3 specific script files which I've attached. I believe these
>> enable
>> > control over some of the MCA parameters. I've experimented with
>> values
>> > of btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still
>> have this
>> > problem. I am still in contact with this consultant but thought
>> it would
>> > be good to contact you folks directly.
>> >
>> > Note:
>> > a) echo $PATH returns: >
>> > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
>> >
>> /opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
>> > ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
>> > ia32e/etc:/usr/cta/modules/3.1.6/bin:
>> >
>> /usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:
>> > .:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin
>> >
>> > b) echo $LD_LIBRARY_PATH returns:
>> > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
>> > /opt/compiler/pgi/linux86-64/6.2/lib:
>> >
>> /opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-glibc2.3-
>> > ia32e/lib
>> >
>> > I've attached the following files:
>> > 1) Gzipped versions of the .out & .err files of the failed job.
>> > 2) ompi_info.log: The output of ompi_info -all
>> > 3) mpirun, mpirun.lsf, openmpi_wrapper: the three script files
>> provided
>> > to me by the ARL consultant. I store these in my home directory
>> and
>> > experimented with the MCA parameter btl_mvapi_ib_timeout in
>> mpirun.
>> > 4) The script file script_openmpi that I use to submit the job.
>> >
>> > I am unable to provide you with the config.log file as I cannot
>> find it
>> > in the top level Open MPI directory.
>> >
>> > I am also unable to provide you with details on the specific
>> cluster
>> > that I'm running in terms of the network. I know they use
>> Infiniband
>> and
>> > some more detail may be found on:
>> >
>> > http://www.arl.hpc.mil/Systems/mjm.html
>> >
>> > Some other info:
>> > a) uname -a returns: > Linux l1 2.6.5-7.308-smp.arl-msrc #2
>> SMP Thu Jan 10 09:18:41 EST 2008
>> > x86_64 x86_64 x86_64 GNU/Linux
>> >
>> > b) ulimit -l returns: unlimited
>> >
>> > I cannot see a pattern as to which nodes are bad and which are
>> good ...
>> >
>> >
>> > Note that I found in the mail archives that someone had a similar
>> > problem in transposing a matrix with 16 million elements. The only
>> > answer I found in the thread was to increase the value of
>> > btl_mvapi_ib_timeout to 14 or 16, something I've done already.
>> >
>> > I'm hoping that there must be a way out of this problem. I need to
>> > get my code running as I'm under pressure to produce results for a
>> > grant that's paying me.
>> >
>> > If you have any feedback I would be hugely grateful.
>> >
>> > Sincerely,
>> >
>> > Peter Diamessis
>> > Cornell University
>> >
>> >
>> > >
>> ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>