Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Markus Daene (markus.daene_at_[hidden])
Date: 2007-06-17 07:32:13


Hi Jeff,

thanks for your comments.

1. I will report this to the GE mailing list.

2. We have a cluster of 18 nodes with 16 cores each (8x dual core
Opteron). So we plan to run something between 1...128 processes in
total, 16 per node. Of course, if the sm component allocates 512MB x 16
on on node, this is 8GB, just for the MPI, it will be too much. I
reduced the size like Ralph suggests.

3. I think it will not be possible to use the OpenFabrics kernel/user
stack. The machine was installed by SUN, it seems that they did not use
the one from OpenFabrics. I guess it will be a hard discussion to change
this and we cannot do this by our own, we will eventually loose the support.

4. I will try if the DMA engine would works better instead of the sm
component.
We will run 16 processes per node with different message sizes. We are
using 2 HCA on each node (bonded).

Markus

Jeff Squyres wrote:
> In addition to what Ralph said, I have the following random comments:
>
> 1. You'll have to ask on the GE mailing lists about the GE issues
> (2gb vs. 2000mb, etc.); I doubt we'll be of much help here on this list.
>
> 2. Do you have a very large SMP machine (i.e., 16 cores or more)?
> More specifically, how many MPI processes do you plan to run at once
> on a host?
>
> 3. Unrelated to the SMP issue, I see that you are using the
> InfiniBand Mellanox VAPI interface (mvapi BTL). Is there any chance
> that you can upgrade to the newer OpenFabrics kernel/user stack? All
> the IB vendors support it for their HPC customers. FWIW: all Open
> MPI InfiniBand work is being done in support of OpenFabrics; the
> "mvapi" BTL is only maintained for backward compatibility and has had
> no new work done on it in at least a year. See http://www.open-
> mpi.org/faq/?category=openfabrics#vapi-support.
>
> 4. Note that depending on your application (e.g., if it primarily
> sends large messages), it *may* be faster to use the DMA engine in
> your IB interface and not use Open MPI's shared memory interface.
> But there are a lot of factors involved here, such as the size of
> your typical messages, how many processes you run per host (i.e., I'm
> assuming you have one HCA that would need to service all the
> processes), etc.
>
>
> On Jun 10, 2007, at 6:04 PM, Ralph Castain wrote:
>
>
>> Hi Markus
>>
>> There are two MCA params that can help you, I believe:
>>
>> 1. You to set the maximum size of the shared memory file with
>>
>> -mca mpool_sm_max_size xxx
>>
>> where xxx is the maximum memory file you want, expressed in bytes. The
>> default value I see is 512MBytes.
>>
>> 2. You can set the size/peer of the file, again in bytes:
>>
>> -mca mpool_sm_per_peer_size xxx
>>
>> This will allocate a file that is xxx * num_procs_on_the_node on
>> each node,
>> up to the maximum file size (either the 512MB default or whatever you
>> specified using the previous param). This defaults to 32MBytes/proc.
>>
>>
>> I see that there is also a minimum (total, not per-proc) file size
>> that
>> defaults to 128MBytes. If that is still too large, you can adjust
>> it using
>>
>> -mca mpool_sm_min_size yyy
>>
>>
>> Hope that helps
>> Ralph
>>
>>
>>
>> On 6/10/07 2:55 PM, "Markus Daene" <markus.daene_at_physik.uni-
>> halle.de> wrote:
>>
>>
>>> Dear all,
>>>
>>> I hope I am in the correct mailing list with my problem.
>>> I try to run openmpi with the gridengine(6.0u10, 6.1). Therefore I
>>> compiled openmpi (1.2.2),
>>> which has the gridengine support included, I have checked it with
>>> ompi_info.
>>> In principle, openmpi runs well.
>>> The gridengine is configured such that the user has to specify the
>>> memory consumption
>>> via the h_vmem option. Then I noticed that with a larger number of
>>> processes the job
>>> is killed by the gridengine because of taking too much memory.
>>> To take a closer look on that, I wrote a small and simple
>>> (Fortran) MPI
>>> program which has just a MPI_Init
>>> and a (static) array, in my case of 50MB, then the programm goes
>>> into a
>>> (infinite) loop, because it
>>> takes some time until the gridengine reports the maxvmem.
>>> I found, that if the processes run all on different nodes, there
>>> is only
>>> a offset per process, at least
>>> a linear scaling. But it becomes worse when the jobs run on one node.
>>> There it seems to be a quadratic
>>> scaling with the offset, in my case about 30M. I made a list of the
>>> virtual memory reported by the
>>> gridengine, I was running on a 16 processor node:
>>>
>>> #N proc virt. Mem[MB]
>>> 1 182
>>> 2 468
>>> 3 825
>>> 4 1065
>>> 5 1001
>>> 6 1378
>>> 7 1817
>>> 8 2303
>>> 12 4927
>>> 16 8559
>>>
>>> the pure program should need N*50MB, for 16 it is only 800M, but it
>>> takes 10 times more, >7GB!!!
>>> Of course, the gridengine will kills the job is this overtaking is
>>> not
>>> taken into accout,
>>> because of too much virtual memory consumption. The momory
>>> consumtion is
>>> not related to the grid engine,
>>> it is the same if I run from the command line.
>>> I guess it might be related to the 'sm' component of the btl.
>>> Is it possible to avoid the quadratic scaling?
>>> Of course I could use the vapi/tcp component only like
>>> mpirun --mca btl mvapi -np 16 ./my_test_program
>>> in this case the virtual memory is fine, but it will not be what one
>>> wants on a smp node.
>>>
>>>
>>> then it becomes ever worse:
>>> openmpi nicely report the (max./act.) used virtual memory to the grid
>>> engine as sum of all processes.
>>> This value is the compared with the one the user has specified
>>> with the
>>> h_vmem option, but the
>>> gridengine takes this value per process for the allocation of the job
>>> (works) and does not multiply
>>> this with the number of processes. Maybe one should report this to
>>> the
>>> gridenging mailing list, but it
>>> could be related as well for the openmpi interface.
>>>
>>> The last thing I noticed:
>>> It seems that if the v_mem option for gridengine jobs is specified
>>> like
>>> '2.0G' my test job was
>>> immedialtely killed; but when I specify '2000M' (which is obviously
>>> less) it work. The gridengine
>>> puts the job allways on the correct node as requested, but I think
>>> there
>>> is might be a problem in
>>> the openmpi interface.
>>>
>>>
>>> It would be nice if someone could give some hints how to avoid the
>>> quadratic scaling or maybe to think
>>> if this is really neccessary in openmpi.
>>>
>>>
>>> Thanks.
>>> Markus Daene
>>>
>>>
>>>
>>>
>>> my compiling options:
>>> ./configure --prefix=/not_important --enable-static
>>> --with-f90-size=medium --with-f90-max-array-dim=7 --with-mpi-para
>>> m-check=always --enable-cxx-exceptions --with-mvapi
>>> --enable-mca-no-build=btl-tcp
>>>
>>> ompi_info output:
>>> Open MPI: 1.2.2
>>> Open MPI SVN revision: r14613
>>> Open RTE: 1.2.2
>>> Open RTE SVN revision: r14613
>>> OPAL: 1.2.2
>>> OPAL SVN revision: r14613
>>> Prefix: /usrurz/openmpi/1.2.2/pathscale_3.0
>>> Configured architecture: x86_64-unknown-linux-gnu
>>> Configured by: root
>>> Configured on: Mon Jun 4 16:04:38 CEST 2007
>>> Configure host: GE1N01
>>> Built by: root
>>> Built on: Mon Jun 4 16:09:37 CEST 2007
>>> Built host: GE1N01
>>> C bindings: yes
>>> C++ bindings: yes
>>> Fortran77 bindings: yes (all)
>>> Fortran90 bindings: yes
>>> Fortran90 bindings size: small
>>> C compiler: pathcc
>>> C compiler absolute: /usrurz/pathscale/bin/pathcc
>>> C++ compiler: pathCC
>>> C++ compiler absolute: /usrurz/pathscale/bin/pathCC
>>> Fortran77 compiler: pathf90
>>> Fortran77 compiler abs: /usrurz/pathscale/bin/pathf90
>>> Fortran90 compiler: pathf90
>>> Fortran90 compiler abs: /usrurz/pathscale/bin/pathf90
>>> C profiling: yes
>>> C++ profiling: yes
>>> Fortran77 profiling: yes
>>> Fortran90 profiling: yes
>>> C++ exceptions: yes
>>> Thread support: posix (mpi: no, progress: no)
>>> Internal debug support: no
>>> MPI parameter check: always
>>> Memory profiling support: no
>>> Memory debugging support: no
>>> libltdl support: yes
>>> Heterogeneous support: yes
>>> mpirun default --prefix: no
>>> MCA backtrace: execinfo (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA paffinity: linux (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA maffinity: first_use (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA timer: linux (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA installdirs: config (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>>> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>>> MCA coll: basic (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA coll: self (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA coll: tuned (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA io: romio (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA btl: self (MCA v1.0, API v1.0.1, Component
>>> v1.2.2)
>>> MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.2)
>>> MCA btl: mvapi (MCA v1.0, API v1.0.1, Component
>>> v1.2.2)
>>> MCA topo: unity (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA osc: pt2pt (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.2)
>>> MCA errmgr: orted (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA errmgr: proxy (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA gpr: proxy (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA gpr: replica (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA iof: proxy (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA ns: proxy (MCA v1.0, API v2.0, Component
>>> v1.2.2)
>>> MCA ns: replica (MCA v1.0, API v2.0, Component
>>> v1.2.2)
>>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ras: dash_host (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA ras: localhost (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA ras: gridengine (MCA v1.0, API v1.3,
>>> Component v1.2.2)
>>> MCA ras: slurm (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA rds: hostfile (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA rds: proxy (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA rds: resfile (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA rmaps: round_robin (MCA v1.0, API v1.3,
>>> Component v1.2.2)
>>> MCA rmgr: proxy (MCA v1.0, API v2.0, Component
>>> v1.2.2)
>>> MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.2)
>>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA pls: proxy (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA pls: gridengine (MCA v1.0, API v1.3,
>>> Component v1.2.2)
>>> MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.2)
>>> MCA pls: slurm (MCA v1.0, API v1.3, Component
>>> v1.2.2)
>>> MCA sds: env (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA sds: singleton (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.2)
>>> MCA sds: slurm (MCA v1.0, API v1.0, Component
>>> v1.2.2)
>>>
>>> ----------------------------------------------------------
>>> Markus Daene
>>> Martin Luther University Halle-Wittenberg
>>> Naturwissenschaftliche Fakultaet II
>>> Institute of Physics
>>> Von Seckendorff-Platz 1 (room 1.28)
>>> 06120 Halle
>>> Germany
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>