Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] machines swapping in running job[Scanned]
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-05 10:47:47


Arif --

Sorry for the delay in replying.

Believe it or not, almost this exact issue just came up with the IBM
Benchmark Center; they were using Open MPI with MPIRandomAccess and
experiencing problems with running out of memory. We didn't get a
full set of data and experiments run; it was somewhat odd that the
problem seemed to happen most often with the Intel compilers
(preliminary tests shows that we couldn't replicate the problem with
the gcc compiler on the same problem size).

However, the IBM Benchmark Center engineers were able to get
successful runs in by using the btl_openib_free_list_max MCA
parameter. This parameter essentially limits how much space the
lowest-level IB driver in OMPI uses for fragment lists (it's actually
fairly complex as to what it exactly does and how it helps in this
situation -- insert "waving hands" image here...). This parameter
defaults to "infinite". Setting it to a finite value can allow
MPIRandomAccess to complete; I believe that the IBC engineers used
values of 2000 and 4000 for their systems.

On Apr 22, 2008, at 12:10 PM, Arif Ali wrote:

> Hi list,
>
> I had a similar problem last year with IMB when the the job would just
> hang on a PowerPC cluster, for which Jeff Sqyres gave me many pointers
> to change paramaters to fix the problem. Now with another cluster
> that I
> am building the IMB job hangs in the same place and also the
> machines in
> the cluster start swapping at the time of the hang. Following from
> what
> Jeff suggested I have tried the following mca paramaters
>
> btl_openib_flags=1
> btl_openib_ib_timeout=20
> mpool_base_verbose=1
> mpool_base_use_mem_hooks=1
> btl_openib_eager_limit=3072
> #btl_openib_eager_limit=4096
> btl_openib_max_send_size=12288
>
> After setting these paramaters, the machines swapped, but a lot less
> than before and got a lot further in the run and ran to completion.
> Are
> there any further suggestions on paramaters that can be tweaked to get
> these machines not to swap.
>
> I am also having the same swapping issue when running the HPCC
> benchmark
> when it reaches the MPIRandomAccess where it swaps on all machines
> and
> we can no longer access them and therefore we have to reboot the
> machines.
>
> OS: SLES 10
> Kernel: 2.6.16.46-0.12-smp
> OFED release: 1.3
> openmpi: 1.2.5 and 1.2.6 using btl openib
> Switch: TopSpin
> SM: on TopSpin switch
> Ulimit has been set to unlimited as suggested in the FAQ
>
> One thing to note, Both jobs run with no problems using TCP.
>
>
> regards,
> --
>
> Arif Ali
> Software Engineer
> OCF plc
>
> Mobile: +44 (0)7970 148 122
> DDI: +44 (0)114 257 2240
> Office: +44 (0)114 257 2200
> Fax: +44 (0)114 257 0022
> Email: aali_at_[hidden]
> Web: http://www.ocf.co.uk
>
> Support Phone: +44 (0)845 702 3829
> Support E-mail: support_at_[hidden]
>
> Skype: arif_ali80
> MSN: aali_at_[hidden]
>
> This email is confidential in that it is intended for the exclusive
> attention of the addressee(s) indicated. If you are not the intended
> recipient, this email should not be read or disclosed to any other
> person. Please notify the sender immediately and delete this email
> from
> your computer system. Any opinions expressed are not necessarily those
> of the company from which this email was sent and, whilst to the
> best of
> our knowledge no viruses or defects exist, no responsibility can be
> accepted for any loss or damage arising from its receipt or subsequent
> use of this email.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems