Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Where is the error? (MPI program in fortran)
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-04-17 14:15:06


Sounds like you're freeing memory that does not belong to you. Or you have some kind of memory corruption somehow.

On Apr 17, 2014, at 2:01 PM, Oscar Mojica <o_mojical_at_[hidden]> wrote:

> Hello guys
>
> I used the command
>
> ulimit -s unlimited
>
> and got
>
> stack size (kbytes, -s) unlimited
>
> but when I ran the program got the same error. So I used the gdb debugger, I compiled using
>
> mpif90 -g -o mpivfsa_versao2.f exe
>
> I ran the program and then I ran gdb with both the executable and the core file name as arguments and got the following
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00002aaaab59b54c in free () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) backtrace
> #0 0x00002aaaab59b54c in free () from /lib/x86_64-linux-gnu/libc.so.6
> #1 0x0000000000406801 in inv_grav3d_vfsa () at mpivfsa_versao2.f:131
> #2 0x0000000000406b88 in main (argc=1, argv=0x7fffffffe387) at mpivfsa_versao2.f:9
> #3 0x00002aaaab53976d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
> #4 0x0000000000401399 in _start ()
>
> These are the lines
>
> 9 use mpi
> 131 deallocate(zv,xrec,yrec,xprm,yprm)
>
> I think the problem is not memory, the problem is related to MPI
>
> Which could be the error?
>
> Oscar Fabian Mojica Ladino
> Geologist M.S. in Geophysics
>
>
> > From: o_mojical_at_[hidden]
> > Date: Wed, 16 Apr 2014 15:17:51 -0300
> > To: users_at_[hidden]
> > Subject: Re: [OMPI users] Where is the error? (MPI program in fortran)
> >
> > Gus
> > It is a single machine and i have installed Ubuntu 12.04 LTS. I left my computer in the college but I will try to follow your advice when I can and tell you about it.
> >
> > Thanks
> >
> > Enviado desde mi iPad
> >
> > > El 16/04/2014, a las 14:17, "Gus Correa" <gus_at_[hidden]> escribió:
> > >
> > > Hi Oscar
> > >
> > > This is a long shot, but maybe worth trying.
> > > I am assuming you're using Linux, or some form or Unix, right?
> > >
> > > You may try to increase the stack size.
> > > The default in Linux is often too small for large programs.
> > > Sometimes this may cause a segmentation fault, even if the
> > > program is correct.
> > >
> > > You can check what you have with:
> > >
> > > ulimit -a (bash)
> > >
> > > or
> > >
> > > limit (csh or tcsh)
> > >
> > > Then set it to a larger number or perhaps to unlimited,
> > > e.g.:
> > >
> > > ulimit -s unlimited
> > >
> > > or
> > >
> > > limit stacksize unlimited
> > >
> > > You didn't say anything about the computer(s) you are using.
> > > Is this a single machine, a cluster, something else?
> > >
> > > Anyway, resetting the statck size may depend a bit on what you
> > > have in /etc/security/limits.conf,
> > > and whether it allows you to increase the stack size.
> > > If it is a single computer that you have root access, you may
> > > do it yourself.
> > > There are other limits worth increasing (number of open files,
> > > max locked memory).
> > > For instance, this could go in limits.conf:
> > >
> > > * - memlock -1
> > > * - stack -1
> > > * - nofile 4096
> > >
> > > See 'man limits.conf' for details.
> > >
> > > If it is a cluster, and this should be set on all nodes,
> > > and you may need to ask your system administrator to do it.
> > >
> > > I hope this helps,
> > > Gus Correa
> > >
> > >> On 04/16/2014 11:24 AM, Gus Correa wrote:
> > >>> On 04/16/2014 08:30 AM, Oscar Mojica wrote:
> > >>> How would be the command line to compile with the option -g ? What
> > >>> debugger can I use?
> > >>> Thanks
> > >>>
> > >>
> > >> Replace any optimization flags (-O2, or similar) by -g.
> > >> Check if your compiler has the -traceback flag or similar
> > >> (man compiler-name).
> > >>
> > >> The gdb debugger is normally available on Linux (or you can install it
> > >> with yum, apt-get, etc). An alternative is ddd, with a GUI (can also be
> > >> installed from yum, etc).
> > >> If you use a commercial compiler you may have a debugger with a GUI.
> > >>
> > >>> Enviado desde mi iPad
> > >>>
> > >>>> El 15/04/2014, a las 18:20, "Gus Correa" <gus_at_[hidden]>
> > >>>> escribió:
> > >>>>
> > >>>> Or just compiling with -g or -traceback (depending on the compiler) will
> > >>>> give you more information about the point of failure
> > >>>> in the error message.
> > >>>>
> > >>>>> On 04/15/2014 04:25 PM, Ralph Castain wrote:
> > >>>>> Have you tried using a debugger to look at the resulting core file? It
> > >>>>> will probably point you right at the problem. Most likely a case of
> > >>>>> overrunning some array when #temps > 5
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Apr 15, 2014 at 10:46 AM, Oscar Mojica <o_mojical_at_[hidden]
> > >>>>> <mailto:o_mojical_at_[hidden]>> wrote:
> > >>>>>
> > >>>>> Hello everybody
> > >>>>>
> > >>>>> I implemented a parallel simulated annealing algorithm in fortran.
> > >>>>> The algorithm is describes as follows
> > >>>>>
> > >>>>> 1. The MPI program initially generates P processes that have rank
> > >>>>> 0,1,...,P-1.
> > >>>>> 2. The MPI program generates a starting point and sends it for all
> > >>>>> processes set T=T0
> > >>>>> 3. At the current temperature T, each process begins to execute
> > >>>>> iterative operations
> > >>>>> 4. At end of iterations, process with rank 0 is responsible for
> > >>>>> collecting the solution obatined by
> > >>>>> 5. Each process at current temperature and broadcast the best
> > >>>>> solution of them among all participating
> > >>>>> process
> > >>>>> 6. Each process cools the temperatue and goes back to step 3, until
> > >>>>> the maximum number of temperatures
> > >>>>> is reach
> > >>>>>
> > >>>>> I compiled with: mpif90 -o exe mpivfsa_version2.f
> > >>>>> and run with: mpirun -np 4 ./exe in a single machine
> > >>>>>
> > >>>>> So I have 4 processes, 1 iteration per temperature and for example
> > >>>>> 15 temperatures. When I run the program
> > >>>>> with just 5 temperatures it works well, but when the number of
> > >>>>> temperatures is higher than 5 it doesn't write the
> > >>>>> ouput files and I get the following error message:
> > >>>>>
> > >>>>>
> > >>>>> [oscar-Vostro-3550:06740] *** Process received signal ***
> > >>>>> [oscar-Vostro-3550:06741] *** Process received signal ***
> > >>>>> [oscar-Vostro-3550:06741] Signal: Segmentation fault (11)
> > >>>>> [oscar-Vostro-3550:06741] Signal code: Address not mapped (1)
> > >>>>> [oscar-Vostro-3550:06741] Failing at address: 0xad6af
> > >>>>> [oscar-Vostro-3550:06742] *** Process received signal ***
> > >>>>> [oscar-Vostro-3550:06740] Signal: Segmentation fault (11)
> > >>>>> [oscar-Vostro-3550:06740] Signal code: Address not mapped (1)
> > >>>>> [oscar-Vostro-3550:06740] Failing at address: 0xad6af
> > >>>>> [oscar-Vostro-3550:06742] Signal: Segmentation fault (11)
> > >>>>> [oscar-Vostro-3550:06742] Signal code: Address not mapped (1)
> > >>>>> [oscar-Vostro-3550:06742] Failing at address: 0xad6af
> > >>>>> [oscar-Vostro-3550:06740] [ 0]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f49ee2224a0]
> > >>>>> [oscar-Vostro-3550:06740] [ 1]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(cfree+0x1c) [0x7f49ee26f54c]
> > >>>>> [oscar-Vostro-3550:06740] [ 2] ./exe() [0x406742]
> > >>>>> [oscar-Vostro-3550:06740] [ 3] ./exe(main+0x34) [0x406ac9]
> > >>>>> [oscar-Vostro-3550:06740] [ 4]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
> > >>>>> [0x7f49ee20d76d]
> > >>>>> [oscar-Vostro-3550:06742] [ 0]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f6877fdc4a0]
> > >>>>> [oscar-Vostro-3550:06742] [ 1]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(cfree+0x1c) [0x7f687802954c]
> > >>>>> [oscar-Vostro-3550:06742] [ 2] ./exe() [0x406742]
> > >>>>> [oscar-Vostro-3550:06742] [ 3] ./exe(main+0x34) [0x406ac9]
> > >>>>> [oscar-Vostro-3550:06742] [ 4]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
> > >>>>> [0x7f6877fc776d]
> > >>>>> [oscar-Vostro-3550:06742] [ 5] ./exe() [0x401399]
> > >>>>> [oscar-Vostro-3550:06742] *** End of error message ***
> > >>>>> [oscar-Vostro-3550:06740] [ 5] ./exe() [0x401399]
> > >>>>> [oscar-Vostro-3550:06740] *** End of error message ***
> > >>>>> [oscar-Vostro-3550:06741] [ 0]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7fa6c4c6e4a0]
> > >>>>> [oscar-Vostro-3550:06741] [ 1]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(cfree+0x1c) [0x7fa6c4cbb54c]
> > >>>>> [oscar-Vostro-3550:06741] [ 2] ./exe() [0x406742]
> > >>>>> [oscar-Vostro-3550:06741] [ 3] ./exe(main+0x34) [0x406ac9]
> > >>>>> [oscar-Vostro-3550:06741] [ 4]
> > >>>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
> > >>>>> [0x7fa6c4c5976d]
> > >>>>> [oscar-Vostro-3550:06741] [ 5] ./exe() [0x401399]
> > >>>>> [oscar-Vostro-3550:06741] *** End of error message ***
> > >>>>>
> > >>>>> --------------------------------------------------------------------------
> > >>>>>
> > >>>>> mpirun noticed that process rank 0 with PID 6917 on node
> > >>>>> oscar-Vostro-3550 exited on signal 11 (Segmentation fault).
> > >>>>>
> > >>>>> --------------------------------------------------------------------------
> > >>>>>
> > >>>>> 2 total processes killed (some possibly by mpirun during cleanup)
> > >>>>>
> > >>>>> If there is a segmentation fault in no case it must work .
> > >>>>> I checked the program and didn't find the error. Why does the
> > >>>>> program work with five temperatures?
> > >>>>> Could someone help me to find the error and answer my question
> > >>>>> please.
> > >>>>>
> > >>>>> The program and the necessary files to run it are attached
> > >>>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>>
> > >>>>> _Oscar Fabian Mojica Ladino_
> > >>>>> Geologist M.S. in Geophysics
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> users_at_[hidden] <mailto:users_at_[hidden]>
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> users_at_[hidden]
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>
> > >>>> _______________________________________________
> > >>>> users mailing list
> > >>>> users_at_[hidden]
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> users_at_[hidden]
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>
> > >>
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/