Vaz, Guilherme wrote:
> Thanks for your email. Some more explanation then:
> 1) We have made this estimation of memory already in the past.
> My code takes for n*Mcells => 2.5n*GBRam. So for 1.2MCells we need 3GB Ram.
> The problem occurs in one PC with 12GB Ram and 4 cores. So 12GB Ram is enough.
> So far (and in the other systems) if we had problems with memory it "just"
> starts to swap but did/does not crash.
Now you are speaking. Much better.
So, you know your problem size, you know how much memory you need,
at least w.r.t. what you allocate directly.
> 2) The code is my code, so I am sure that with mpiexec
> or without mpiexec the code is the same and that I don't
> use OpenMP directly in the code.
I am a bit surprised that the same code runs with and without mpiexec.
Do you mean the same executable?
Or are they different executables, one
of which you perhaps compile with pre-processor directives to get around
the MPI calls and make it sequential?
As for OpenMP it still remains the possibility that the libraries you
call use threads (with or without OpenMP).
> But, we also use
> Intel MKL libraries together with PETSC linear-system solvers.
> I know that MKL tries to start several threads for each MPI process
> (yes process not processor). We disable it by setting MKL_NUM_THREADS=1
> (otherwise we see immediately in the task manager the several threads starting).
I would catch all the return codes from PETSc calls, print them out if
in error, and call MPI_Abort, if this is not yet in your code, and keep
there at least while you sort out where the problem is.
If using MKL directly, not via PETSC, do the same with the MKL calls.
> 3) All the runs are done in a 64bits Intel machine with 4 cores and 12GB Ram.
> We don't set any affinity or similar stuff.
I am suprised that it runs with -np 32 on only 4 physical cores,
which is a lot of oversubscription.
I wonder if this reduces walltime.
> 4) I could always start more MPI processes than cores,
> as long the memory was enough. And the memory is enough,
> otherwise how can the same problem with 2,4,8,16 MPI processes
> not work and with 32 work. So that is why I thought on stack memory problem.
> 5) I will see what gdb says about a core-dump tomorrow.
> Gus, is this more clear?
Do you have any tip now?
Old tip again:
Did you monitor memory use with top while the job is running?
"top -H" shows you all threads.
> Don't you think
> this a stack-memory problem, which btw is already ulimit -s unlimited?
That certainly helps for number crunching,
although it may not solve your specific problem.
> Thanks guys.
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
> Sent: Thursday, December 16, 2010 5:55 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] segmentation fault
> Vaz, Guilherme wrote:
>> Ok, ok. It is indeed a CFD program, and Gus got it right. Number of cells per core means memory per core (sorry for the inaccuracy).
>> My PC has 12GB of RAM.
> Can you do one of those typical engineering calculations, a back of the
> envelope estimate of how much memory your program needs for a certain
> problem size?
> This is the first thing to do.
> It should tell you whether 12GB is good enough or not.
> How many cells, how much memory each cell or array or structure takes,
> etc ...
>> And the same calculation runs fine in an old Ubuntu8.04 32bits with 4GB RAM.
>> What I find strange is that the same problems runs with 1 core (without evoking mpiexec)
> This one is likely to be a totally different version of the code,
> either serial or threaded (perhaps with OpenMP, NOT OpenMPI).
>> and then for large number of cores/processes, for instance mpiexec -n 32.
> > Something in between not.
> You didn't explain.
> Are all the runs (1 processor, 4 processors, 32 processors)
> in a single machine, or in a cluster?
> How many computers are used on each run?
> How much memory does each machine have?
> Any error messages?
> It makes a difference to understand what is going on.
> You may saturate memory in a single machine (your 4-processor run),
> but not on, say, four machines (if this is what you mean when you
> say it runs on 32 processors).
> Please, clarify.
> With the current problem description, a solution may not exist,
> or there may be multiple solutions for multiple and
> yet not described issues, or the solution may have nothing to do
> with the description you provided or with MPI.
> A mathematician would call this an "ill posed problem",
> a la Haddamard. :)
> But that is how debugging parallel programs go.
>> And it is not a bug in the program because it runs in other machines
>> and the code has not been changed.
> That is no guarantee against bugs.
> They can creep in depending on the computer environment,
> how many computers you are using, the number of processors,
> on any data or parameter that you change,
> on a bunch of different things.
>> Anymore hints?
> Did you try the ones I sent before, regarding stack size,
> and monitoring memory via "top)?
> What did you get?
>> Thanks in advance.
>> dr. ir. Guilherme Vaz
>> CFD Researcher
>> Research & Development
>> E mailto:G.Vaz_at_[hidden]
>> T +31 317 49 33 25
>> 2, Haagsteeg, P.O. Box 28, 6700 AA Wageningen, The Netherlands
>> T +31 317 49 39 11, F +31 317 49 32 45, I www.marin.nl
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
>> Sent: Thursday, December 16, 2010 12:46 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] segmentation fault
>> Maybe a CFD jargon?
>> Perhaps the number (not size) of cells in a mesh/grid being handled
>> by each core/cpu?
>> Ralph Castain wrote:
>>> I have no idea what you mean by "cell sizes per core". Certainly not any
>>> terminology within OMPI...
>>> On Dec 15, 2010, at 3:47 PM, Vaz, Guilherme wrote:
>>>> Dear all,
>>>> I have a problem with openmpi1.3, ifort+mkl v11.1 in Ubuntu10.04
>>>> systems (32 or 64bit). My code worked in Ubuntu8.04 and works in
>>>> RedHat based systems, with slightly different version changes on mkl
>>>> and ifort. There were no changes in the source code.
>>>> The problem is that the application works for small cell sizes per
>>>> core, but not for large cell sizes per core. And it always works for 1
>>>> Example: a grid with 1.2Million cells does not work with mpiexec -n 4
>>>> <my_app> but it works with mpiexec -n 32 <my_app>. It seems that there
>>>> is a maximum of cell/core. And it works with <my_app>.
>>>> Is this a stack size (or any memory problem)? Should I set the ulimit
>>>> -s unlimited not only on my bashrc but also in the ssh environment
>>>> (and how)? Or is something else?
>>>> Any clues/tips?
>>>> Thanks for any help.
>>>> dr. ir. Guilherme Vaz
>>>> CFD Researcher
>>>> Research & Development
>>>> 2, Haagsteeg
>>>> E G.Vaz_at_[hidden] <mailto:G.Vaz_at_[hidden]> P.O. Box 28 T +31 317 49 39 11
>>>> 6700 AA Wageningen F +31 317 49 32 45
>>>> T +31 317 49 33 25 The Netherlands I www.marin.nl <http://www.marin.nl>
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> users mailing list
>> users mailing list
>> users mailing list
> users mailing list
> users mailing list