Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Fw: Re: Open MPI timeout problems.
From: Peter Diamessis (pjd38_at_[hidden])
Date: 2008-06-20 10:46:16


Hi Jeff,

I release appreciate the insight. I will pass your thoughts on to our
system admins.
Hopefully, they can begin exploring the installation of a more
sophisticated stack.

Sincerely,

Peter

Jeff Squyres wrote:
> To clarify what Pasha said: AFAIK, all IB vendors have deprecated the
> use of their mVAPI-based driver stacks in HPC environments (I know
> that Cisco and Mellanox have; I'm not 100% sure about others). We all
> encourage upgrading to the OFED stack (currently at v1.3.1) if
> possible; it's much newer, more modern, and is where all development
> work is occurring these days. Indeed, OMPI is dropping support for
> the older mVAPI-based driver stacks in our upcoming v1.3 release.
>
> Upgrading to a whole new driver stack is not something that can be
> undertaken lightly, though -- it will likely take time for the
> syadmins to evaluate, learn, etc.
>
>
> On Jun 19, 2008, at 5:38 PM, Pavel Shamis (Pasha) wrote:
>
>>
>>> I appreciate the feedback. I'm assuming that this upgrade to the
>>> Open Fabric
>>> driver is something that the System Admin. of the cluster should be
>>> concerned with and not I ?
>> Driver upgrade will require root permissions.
>> Thanks,
>> Pasha
>>
>>>
>>> Thanks,
>>>
>>> Peter
>>>
>>> Peter Diamessis wrote:
>>>>
>>>>
>>>> --- On *Thu, 6/19/08, Pavel Shamis (Pasha)
>>>> /<pasha_at_[hidden]>/* wrote:
>>>>
>>>> From: Pavel Shamis (Pasha) <pasha_at_[hidden]>
>>>> Subject: Re: [OMPI users] Open MPI timeout problems.
>>>> To: pjd38_at_[hidden], "Open MPI Users" <users_at_[hidden]>
>>>> Date: Thursday, June 19, 2008, 5:20 AM
>>>>
>>>> Usually the retry exceed point to some network issue on your
>>>> cluster. I see from the logs that you still
>>>> use MVAPI. If i remember correct, MVAPI include IBADM
>>>> application that should be able to check and debug the network.
>>>> BTW I recommend you to update your MVAPI driver to latest
>>>> OpenFabric driver.
>>>>
>>>> Peter Diamessis wrote:
>>>> > Dear folks,
>>>> >
>>>> > I would appreciate your help on the following:
>>>> >
>>>> > I'm running a parallel CFD code on the Army Research Lab's MJM
>>>> Linux
>>>> > cluster, which uses Open-MPI. I've run the same code on other
>>>> Linux
>>>> > clusters that use MPICH2 and had never run into this problem.
>>>> >
>>>> > I'm quite convinced that the bottleneck for my code is this data
>>>> > transposition routine, although I have not done any rigorous
>>>> profiling
>>>> > to check on it. This is where 90% of the parallel
>>>> communication takes
>>>> > place. I'm running a CFD code that uses a 3-D rectangular
>>>> domain which
>>>> > is partitioned across processors in such a way that each
>>>> processor
>>>> > stores vertical slabs that are contiguous in the x-direction
>>>> but shared
>>>> > across processors in the y-dir. . When a 2-D Fast Fourier
>>>> Transform
>>>> > (FFT) needs to be done, data is transposed such that the
>>>> vertical slabs
>>>> > are now contiguous in the y-dir. in each processor. >
>>>> > The code would normally be run for about 10,000 timesteps. In the
>>>> > specific case which blocks, the job crashes after ~200
>>>> timesteps and at
>>>> > each timestep a large number of 2-D FFTs are performed. For a
>>>> domain
>>>> > with resolution of Nx * Ny * Nz points and P processors,
>>>> during one FFT,
>>>> > each processor performs P Sends and P Receives of a message of
>>>> size
>>>> > (Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such
>>>> Sends/Receives. >
>>>> > I've focused on a case using P=32 procs with Nx=256, Ny=128,
>>>> Nz=175.
>>>> You
>>>> > can see that each FFT involves 2048 communications. I totally
>>>> rewrote my
>>>> > data transposition routine to no longer use specific
>>>> blocking/non-
>>>> > blocking Sends/Receives but to use MPI_ALLTOALL which I would
>>>> hope is
>>>> > optimized for the specific MPI Implementation to do data
>>>> transpositions.
>>>> > Unfortunately, my code still crashes with time-out problems
>>>> like before.
>>>> >
>>>> > This happens for P=4, 8, 16 & 32 processors. The same
>>>> MPI_ALLTOALL
>>>> code
>>>> > worked fine on a smaller cluster here. Note that in the future
>>>> I would
>>>> > like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and
>>>> P=128 or
>>>> > 256 procs. which will involve an order of magnitude more
>>>> communication.
>>>> >
>>>> > Note that I ran the job by submitting it to an LSF queue
>>>> system. I've
>>>> > attached the script file used for that. I basically enter bsub
>>>> -x <
>>>> > script_openmpi at the command line. >
>>>> > When I communicated with a consultant at ARL, he recommended I
>>>> use
>>>> > 3 specific script files which I've attached. I believe these
>>>> enable
>>>> > control over some of the MCA parameters. I've experimented
>>>> with values
>>>> > of btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still
>>>> have this
>>>> > problem. I am still in contact with this consultant but
>>>> thought it would
>>>> > be good to contact you folks directly.
>>>> >
>>>> > Note:
>>>> > a) echo $PATH returns: >
>>>> > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
>>>> >
>>>> /opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
>>>> > ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
>>>> > ia32e/etc:/usr/cta/modules/3.1.6/bin:
>>>> >
>>>> /usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:
>>>> > .:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin
>>>> >
>>>> > b) echo $LD_LIBRARY_PATH returns:
>>>> > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
>>>> > /opt/compiler/pgi/linux86-64/6.2/lib:
>>>> >
>>>> /opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-glibc2.3-
>>>> > ia32e/lib
>>>> >
>>>> > I've attached the following files:
>>>> > 1) Gzipped versions of the .out & .err files of the failed job.
>>>> > 2) ompi_info.log: The output of ompi_info -all
>>>> > 3) mpirun, mpirun.lsf, openmpi_wrapper: the three script files
>>>> provided
>>>> > to me by the ARL consultant. I store these in my home
>>>> directory and
>>>> > experimented with the MCA parameter btl_mvapi_ib_timeout in
>>>> mpirun.
>>>> > 4) The script file script_openmpi that I use to submit the job.
>>>> >
>>>> > I am unable to provide you with the config.log file as I
>>>> cannot find it
>>>> > in the top level Open MPI directory.
>>>> >
>>>> > I am also unable to provide you with details on the specific
>>>> cluster
>>>> > that I'm running in terms of the network. I know they use
>>>> Infiniband
>>>> and
>>>> > some more detail may be found on:
>>>> >
>>>> > http://www.arl.hpc.mil/Systems/mjm.html
>>>> >
>>>> > Some other info:
>>>> > a) uname -a returns: > Linux l1 2.6.5-7.308-smp.arl-msrc
>>>> #2 SMP Thu Jan 10 09:18:41 EST 2008
>>>> > x86_64 x86_64 x86_64 GNU/Linux
>>>> >
>>>> > b) ulimit -l returns: unlimited
>>>> >
>>>> > I cannot see a pattern as to which nodes are bad and which are
>>>> good ...
>>>> >
>>>> >
>>>> > Note that I found in the mail archives that someone had a similar
>>>> > problem in transposing a matrix with 16 million elements. The
>>>> only
>>>> > answer I found in the thread was to increase the value of
>>>> > btl_mvapi_ib_timeout to 14 or 16, something I've done already.
>>>> >
>>>> > I'm hoping that there must be a way out of this problem. I
>>>> need to
>>>> > get my code running as I'm under pressure to produce results
>>>> for a
>>>> > grant that's paying me.
>>>> >
>>>> > If you have any feedback I would be hugely grateful.
>>>> >
>>>> > Sincerely,
>>>> >
>>>> > Peter Diamessis
>>>> > Cornell University
>>>> >
>>>> >
>>>> > >
>>>> ------------------------------------------------------------------------
>>>>
>>>> >
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > users_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
 
-------------------------------------------------------------
Peter Diamessis
Assistant Professor
Environmental Fluid Mechanics & Hydrology
School of Civil and Environmental Engineering
Cornell University
Ithaca, NY 14853
Phone: (607)-255-1719 --- Fax: (607)-255-9004
pjd38_at_[hidden] <mailto:pjd38_at_[hidden]>
http://www.cee.cornell.edu/faculty/pjd38