Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SIGSEGV in OMPI 1.6.x
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-06 08:01:01


If you run into a segv in this code, it almost certainly means that you have heap corruption somewhere. FWIW, that has *always* been what it meant when I've run into segv's in any code under in opal/mca/memory/linux/. Meaning: my user code did something wrong, it created heap corruption, and then later some malloc() or free() caused a segv in this area of the code.

This code is the same ptmalloc memory allocator that has shipped in glibc for years. I'll be hard-pressed to say that any code is 100% bug free :-), but I'd be surprised if there is a bug in this particular chunk of code.

I'd run your code through valgrind or some other memory-checking debugger and see if that can shed any light on what's going on.

On Sep 6, 2012, at 12:06 AM, Yong Qin wrote:

> Hi,
>
> While debugging a mysterious crash of a code, I was able to trace down
> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
>
> (gdb) c
> Continuing.
>
> Program received signal SIGSEGV, Segmentation fault.
> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
> at malloc.c:4385
> 4385 nextsize = chunksize(nextchunk);
> (gdb) l
> 4380 Consolidate other non-mmapped chunks as they arrive.
> 4381 */
> 4382
> 4383 else if (!chunk_is_mmapped(p)) {
> 4384 nextchunk = chunk_at_offset(p, size);
> 4385 nextsize = chunksize(nextchunk);
> 4386 assert(nextsize > 0);
> 4387
> 4388 /* consolidate backward */
> 4389 if (!prev_inuse(p)) {
> (gdb) bt
> #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637,
> mem=0x203a746f74512000) at malloc.c:4385
> #1 0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
> at malloc.c:3511
> #2 0x00002ae6b18ea736 in opal_memory_linux_free_hook
> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
> #3 0x0000000001412fcc in for_dealloc_allocatable ()
> #4 0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
> ) at alloc.F90:1357
> #5 0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
> lasto=..., iphorb=...,
> numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
> fa=..., stress=..., h=...,
> first=@0x0, last=@0x0) at ldau.F:752
> #6 0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
> #7 0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
> #8 0x000000000070e475 in siesta () at siesta.F:23
> #9 0x000000000045e47c in main ()
>
> Can anybody shed some light here on what could be wrong?
>
> Thanks,
>
> Yong Qin
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/