Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segmentation fault
From: Gus Correa (gus_at_[hidden])
Date: 2010-12-16 11:55:15


Vaz, Guilherme wrote:
> Ok, ok. It is indeed a CFD program, and Gus got it right. Number of cells per core means memory per core (sorry for the inaccuracy).
> My PC has 12GB of RAM.

Can you do one of those typical engineering calculations, a back of the
envelope estimate of how much memory your program needs for a certain
problem size?
This is the first thing to do.
It should tell you whether 12GB is good enough or not.
How many cells, how much memory each cell or array or structure takes,
etc ...

> And the same calculation runs fine in an old Ubuntu8.04 32bits with 4GB RAM.
> What I find strange is that the same problems runs with 1 core (without evoking mpiexec)

This one is likely to be a totally different version of the code,
either serial or threaded (perhaps with OpenMP, NOT OpenMPI).

> and then for large number of cores/processes, for instance mpiexec -n 32.
> Something in between not.

You didn't explain.
Are all the runs (1 processor, 4 processors, 32 processors)
in a single machine, or in a cluster?
How many computers are used on each run?
How much memory does each machine have?
Any error messages?

It makes a difference to understand what is going on.
You may saturate memory in a single machine (your 4-processor run),
but not on, say, four machines (if this is what you mean when you
say it runs on 32 processors).

Please, clarify.
With the current problem description, a solution may not exist,
or there may be multiple solutions for multiple and
yet not described issues, or the solution may have nothing to do
with the description you provided or with MPI.
A mathematician would call this an "ill posed problem",
a la Haddamard. :)
But that is how debugging parallel programs go.

> And it is not a bug in the program because it runs in other machines
> and the code has not been changed.
>

That is no guarantee against bugs.
They can creep in depending on the computer environment,
how many computers you are using, the number of processors,
on any data or parameter that you change,
on a bunch of different things.

> Anymore hints?
>

Did you try the ones I sent before, regarding stack size,
and monitoring memory via "top)?
What did you get?

Gus

> Thanks in advance.
>
> Guilherme
>
>
>
>
> dr. ir. Guilherme Vaz
> CFD Researcher
> Research & Development
> E mailto:G.Vaz_at_[hidden]
> T +31 317 49 33 25
>
> MARIN
> 2, Haagsteeg, P.O. Box 28, 6700 AA Wageningen, The Netherlands
> T +31 317 49 39 11, F +31 317 49 32 45, I www.marin.nl
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
> Sent: Thursday, December 16, 2010 12:46 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] segmentation fault
>
> Maybe a CFD jargon?
> Perhaps the number (not size) of cells in a mesh/grid being handled
> by each core/cpu?
>
> Ralph Castain wrote:
>> I have no idea what you mean by "cell sizes per core". Certainly not any
>> terminology within OMPI...
>>
>>
>> On Dec 15, 2010, at 3:47 PM, Vaz, Guilherme wrote:
>>
>>> Dear all,
>>>
>>> I have a problem with openmpi1.3, ifort+mkl v11.1 in Ubuntu10.04
>>> systems (32 or 64bit). My code worked in Ubuntu8.04 and works in
>>> RedHat based systems, with slightly different version changes on mkl
>>> and ifort. There were no changes in the source code.
>>> The problem is that the application works for small cell sizes per
>>> core, but not for large cell sizes per core. And it always works for 1
>>> core.
>>> Example: a grid with 1.2Million cells does not work with mpiexec -n 4
>>> <my_app> but it works with mpiexec -n 32 <my_app>. It seems that there
>>> is a maximum of cell/core. And it works with <my_app>.
>>>
>>> Is this a stack size (or any memory problem)? Should I set the ulimit
>>> -s unlimited not only on my bashrc but also in the ssh environment
>>> (and how)? Or is something else?
>>> Any clues/tips?
>>>
>>> Thanks for any help.
>>>
>>> Gui
>>>
>>>
>>>
>>>
>>> <imagec393d1.JPG><image4c4685.JPG>
>>>
>>> dr. ir. Guilherme Vaz
>>>
>>> CFD Researcher
>>>
>>>
>>> Research & Development
>>>
>>>
>>>
>>>
>>>
>>> *MARIN*
>>>
>>>
>>>
>>>
>>>
>>> 2, Haagsteeg
>>> E G.Vaz_at_[hidden] <mailto:G.Vaz_at_[hidden]> P.O. Box 28 T +31 317 49 39 11
>>> 6700 AA Wageningen F +31 317 49 32 45
>>> T +31 317 49 33 25 The Netherlands I www.marin.nl <http://www.marin.nl>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users