Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-06-10 18:04:02


Hi Markus

There are two MCA params that can help you, I believe:

1. You to set the maximum size of the shared memory file with

-mca mpool_sm_max_size xxx

where xxx is the maximum memory file you want, expressed in bytes. The
default value I see is 512MBytes.

2. You can set the size/peer of the file, again in bytes:

-mca mpool_sm_per_peer_size xxx

This will allocate a file that is xxx * num_procs_on_the_node on each node,
up to the maximum file size (either the 512MB default or whatever you
specified using the previous param). This defaults to 32MBytes/proc.

I see that there is also a minimum (total, not per-proc) file size that
defaults to 128MBytes. If that is still too large, you can adjust it using

-mca mpool_sm_min_size yyy

Hope that helps
Ralph

On 6/10/07 2:55 PM, "Markus Daene" <markus.daene_at_[hidden]> wrote:

> Dear all,
>
> I hope I am in the correct mailing list with my problem.
> I try to run openmpi with the gridengine(6.0u10, 6.1). Therefore I
> compiled openmpi (1.2.2),
> which has the gridengine support included, I have checked it with ompi_info.
> In principle, openmpi runs well.
> The gridengine is configured such that the user has to specify the
> memory consumption
> via the h_vmem option. Then I noticed that with a larger number of
> processes the job
> is killed by the gridengine because of taking too much memory.
> To take a closer look on that, I wrote a small and simple (Fortran) MPI
> program which has just a MPI_Init
> and a (static) array, in my case of 50MB, then the programm goes into a
> (infinite) loop, because it
> takes some time until the gridengine reports the maxvmem.
> I found, that if the processes run all on different nodes, there is only
> a offset per process, at least
> a linear scaling. But it becomes worse when the jobs run on one node.
> There it seems to be a quadratic
> scaling with the offset, in my case about 30M. I made a list of the
> virtual memory reported by the
> gridengine, I was running on a 16 processor node:
>
> #N proc virt. Mem[MB]
> 1 182
> 2 468
> 3 825
> 4 1065
> 5 1001
> 6 1378
> 7 1817
> 8 2303
> 12 4927
> 16 8559
>
> the pure program should need N*50MB, for 16 it is only 800M, but it
> takes 10 times more, >7GB!!!
> Of course, the gridengine will kills the job is this overtaking is not
> taken into accout,
> because of too much virtual memory consumption. The momory consumtion is
> not related to the grid engine,
> it is the same if I run from the command line.
> I guess it might be related to the 'sm' component of the btl.
> Is it possible to avoid the quadratic scaling?
> Of course I could use the vapi/tcp component only like
> mpirun --mca btl mvapi -np 16 ./my_test_program
> in this case the virtual memory is fine, but it will not be what one
> wants on a smp node.
>
>
> then it becomes ever worse:
> openmpi nicely report the (max./act.) used virtual memory to the grid
> engine as sum of all processes.
> This value is the compared with the one the user has specified with the
> h_vmem option, but the
> gridengine takes this value per process for the allocation of the job
> (works) and does not multiply
> this with the number of processes. Maybe one should report this to the
> gridenging mailing list, but it
> could be related as well for the openmpi interface.
>
> The last thing I noticed:
> It seems that if the v_mem option for gridengine jobs is specified like
> '2.0G' my test job was
> immedialtely killed; but when I specify '2000M' (which is obviously
> less) it work. The gridengine
> puts the job allways on the correct node as requested, but I think there
> is might be a problem in
> the openmpi interface.
>
>
> It would be nice if someone could give some hints how to avoid the
> quadratic scaling or maybe to think
> if this is really neccessary in openmpi.
>
>
> Thanks.
> Markus Daene
>
>
>
>
> my compiling options:
> ./configure --prefix=/not_important --enable-static
> --with-f90-size=medium --with-f90-max-array-dim=7 --with-mpi-para
> m-check=always --enable-cxx-exceptions --with-mvapi
> --enable-mca-no-build=btl-tcp
>
> ompi_info output:
> Open MPI: 1.2.2
> Open MPI SVN revision: r14613
> Open RTE: 1.2.2
> Open RTE SVN revision: r14613
> OPAL: 1.2.2
> OPAL SVN revision: r14613
> Prefix: /usrurz/openmpi/1.2.2/pathscale_3.0
> Configured architecture: x86_64-unknown-linux-gnu
> Configured by: root
> Configured on: Mon Jun 4 16:04:38 CEST 2007
> Configure host: GE1N01
> Built by: root
> Built on: Mon Jun 4 16:09:37 CEST 2007
> Built host: GE1N01
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: pathcc
> C compiler absolute: /usrurz/pathscale/bin/pathcc
> C++ compiler: pathCC
> C++ compiler absolute: /usrurz/pathscale/bin/pathCC
> Fortran77 compiler: pathf90
> Fortran77 compiler abs: /usrurz/pathscale/bin/pathf90
> Fortran90 compiler: pathf90
> Fortran90 compiler abs: /usrurz/pathscale/bin/pathf90
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: yes
> Thread support: posix (mpi: no, progress: no)
> Internal debug support: no
> MPI parameter check: always
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> Heterogeneous support: yes
> mpirun default --prefix: no
> MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.2)
> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.2)
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.2)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.2)
> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.2)
> MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.2)
> MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.2)
> MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.2)
> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.2)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.2.2)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.2)
> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.2)
> MCA io: romio (MCA v1.0, API v1.0, Component v1.2.2)
> MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.2)
> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.2)
> MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.2)
> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.2)
> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.2)
> MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.2)
> MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.2)
> MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.2)
> MCA btl: mvapi (MCA v1.0, API v1.0.1, Component v1.2.2)
> MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.2)
> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.2)
> MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.2)
> MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.2)
> MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.2)
> MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.2)
> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.2)
> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.2)
> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.2)
> MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.2)
> MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.2)
> MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.2)
> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.2)
> MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.2)
> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.2)
> MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.2)
> MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.2)
> MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.2)
> MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.2)
> MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.2)
> MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.2)
> MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.2)
> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.2)
> MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.2)
> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.2)
> MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.2)
> MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.2)
> MCA sds: env (MCA v1.0, API v1.0, Component v1.2.2)
> MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.2)
> MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.2)
> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.2)
> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.2)
>
> ----------------------------------------------------------
> Markus Daene
> Martin Luther University Halle-Wittenberg
> Naturwissenschaftliche Fakultaet II
> Institute of Physics
> Von Seckendorff-Platz 1 (room 1.28)
> 06120 Halle
> Germany
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel