Thanks for your email. Some more explanation then:
1) We have made this estimation of memory already in the past. My code takes for n*Mcells => 2.5n*GBRam. So for 1.2MCells we need 3GB Ram. The problem occurs in one PC with 12GB Ram and 4 cores. So 12GB Ram is enough. So far (and in the other systems) if we had problems with memory it "just" starts to swap but did/does not crash.
2) The code is my code, so I am sure that with mpiexec or without mpiexec the code is the same and that I don't use OpenMP directly in the code. But, we also use Intel MKL libraries together with PETSC linear-system solvers. I know that MKL tries to start several threads for each MPI process (yes process not processor). We disable it by setting MKL_NUM_THREADS=1 (otherwise we see immediately in the task manager the several threads starting).
3) All the runs are done in a 64bits Intel machine with 4 cores and 12GB Ram. We don't set any affinity or similar stuff.
4) I could always start more MPI processes than cores, as long the memory was enough. And the memory is enough, otherwise how can the same problem with 2,4,8,16 MPI processes not work and with 32 work. So that is why I thought on stack memory problem.
5) I will see what gdb says about a core-dump tomorrow.
Gus, is this more clear? Do you have any tip now? Don't you think this a stack-memory problem, which btw is already ulimit -s unlimited?
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
Sent: Thursday, December 16, 2010 5:55 PM
To: Open MPI Users
Subject: Re: [OMPI users] segmentation fault
Vaz, Guilherme wrote:
> Ok, ok. It is indeed a CFD program, and Gus got it right. Number of cells per core means memory per core (sorry for the inaccuracy).
> My PC has 12GB of RAM.
Can you do one of those typical engineering calculations, a back of the
envelope estimate of how much memory your program needs for a certain
This is the first thing to do.
It should tell you whether 12GB is good enough or not.
How many cells, how much memory each cell or array or structure takes,
> And the same calculation runs fine in an old Ubuntu8.04 32bits with 4GB RAM.
> What I find strange is that the same problems runs with 1 core (without evoking mpiexec)
This one is likely to be a totally different version of the code,
either serial or threaded (perhaps with OpenMP, NOT OpenMPI).
> and then for large number of cores/processes, for instance mpiexec -n 32.
> Something in between not.
You didn't explain.
Are all the runs (1 processor, 4 processors, 32 processors)
in a single machine, or in a cluster?
How many computers are used on each run?
How much memory does each machine have?
Any error messages?
It makes a difference to understand what is going on.
You may saturate memory in a single machine (your 4-processor run),
but not on, say, four machines (if this is what you mean when you
say it runs on 32 processors).
With the current problem description, a solution may not exist,
or there may be multiple solutions for multiple and
yet not described issues, or the solution may have nothing to do
with the description you provided or with MPI.
A mathematician would call this an "ill posed problem",
a la Haddamard. :)
But that is how debugging parallel programs go.
> And it is not a bug in the program because it runs in other machines
> and the code has not been changed.
That is no guarantee against bugs.
They can creep in depending on the computer environment,
how many computers you are using, the number of processors,
on any data or parameter that you change,
on a bunch of different things.
> Anymore hints?
Did you try the ones I sent before, regarding stack size,
and monitoring memory via "top)?
What did you get?
> Thanks in advance.
> dr. ir. Guilherme Vaz
> CFD Researcher
> Research & Development
> E mailto:G.Vaz_at_[hidden]
> T +31 317 49 33 25
> 2, Haagsteeg, P.O. Box 28, 6700 AA Wageningen, The Netherlands
> T +31 317 49 39 11, F +31 317 49 32 45, I www.marin.nl
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
> Sent: Thursday, December 16, 2010 12:46 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] segmentation fault
> Maybe a CFD jargon?
> Perhaps the number (not size) of cells in a mesh/grid being handled
> by each core/cpu?
> Ralph Castain wrote:
>> I have no idea what you mean by "cell sizes per core". Certainly not any
>> terminology within OMPI...
>> On Dec 15, 2010, at 3:47 PM, Vaz, Guilherme wrote:
>>> Dear all,
>>> I have a problem with openmpi1.3, ifort+mkl v11.1 in Ubuntu10.04
>>> systems (32 or 64bit). My code worked in Ubuntu8.04 and works in
>>> RedHat based systems, with slightly different version changes on mkl
>>> and ifort. There were no changes in the source code.
>>> The problem is that the application works for small cell sizes per
>>> core, but not for large cell sizes per core. And it always works for 1
>>> Example: a grid with 1.2Million cells does not work with mpiexec -n 4
>>> <my_app> but it works with mpiexec -n 32 <my_app>. It seems that there
>>> is a maximum of cell/core. And it works with <my_app>.
>>> Is this a stack size (or any memory problem)? Should I set the ulimit
>>> -s unlimited not only on my bashrc but also in the ssh environment
>>> (and how)? Or is something else?
>>> Any clues/tips?
>>> Thanks for any help.
>>> dr. ir. Guilherme Vaz
>>> CFD Researcher
>>> Research & Development
>>> 2, Haagsteeg
>>> E G.Vaz_at_[hidden] <mailto:G.Vaz_at_[hidden]> P.O. Box 28 T +31 317 49 39 11
>>> 6700 AA Wageningen F +31 317 49 32 45
>>> T +31 317 49 33 25 The Netherlands I www.marin.nl <http://www.marin.nl>
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>> users mailing list
> users mailing list
> users mailing list
users mailing list