Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SIGSEGV in OMPI 1.6.x
From: Yong Qin (yong.qin_at_[hidden])
Date: 2012-09-06 12:52:03


Thanks Jeff. I will definitely do the failure analysis. But just
wanted to confirm this isn't something special in OMPI itself, e.g.,
missing some configuration settings, etc.

On Thu, Sep 6, 2012 at 5:01 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> If you run into a segv in this code, it almost certainly means that you have heap corruption somewhere. FWIW, that has *always* been what it meant when I've run into segv's in any code under in opal/mca/memory/linux/. Meaning: my user code did something wrong, it created heap corruption, and then later some malloc() or free() caused a segv in this area of the code.
>
> This code is the same ptmalloc memory allocator that has shipped in glibc for years. I'll be hard-pressed to say that any code is 100% bug free :-), but I'd be surprised if there is a bug in this particular chunk of code.
>
> I'd run your code through valgrind or some other memory-checking debugger and see if that can shed any light on what's going on.
>
>
> On Sep 6, 2012, at 12:06 AM, Yong Qin wrote:
>
>> Hi,
>>
>> While debugging a mysterious crash of a code, I was able to trace down
>> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
>> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
>>
>> (gdb) c
>> Continuing.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
>> at malloc.c:4385
>> 4385 nextsize = chunksize(nextchunk);
>> (gdb) l
>> 4380 Consolidate other non-mmapped chunks as they arrive.
>> 4381 */
>> 4382
>> 4383 else if (!chunk_is_mmapped(p)) {
>> 4384 nextchunk = chunk_at_offset(p, size);
>> 4385 nextsize = chunksize(nextchunk);
>> 4386 assert(nextsize > 0);
>> 4387
>> 4388 /* consolidate backward */
>> 4389 if (!prev_inuse(p)) {
>> (gdb) bt
>> #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637,
>> mem=0x203a746f74512000) at malloc.c:4385
>> #1 0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
>> at malloc.c:3511
>> #2 0x00002ae6b18ea736 in opal_memory_linux_free_hook
>> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
>> #3 0x0000000001412fcc in for_dealloc_allocatable ()
>> #4 0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
>> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
>> ) at alloc.F90:1357
>> #5 0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
>> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
>> lasto=..., iphorb=...,
>> numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
>> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
>> fa=..., stress=..., h=...,
>> first=@0x0, last=@0x0) at ldau.F:752
>> #6 0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
>> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
>> #7 0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
>> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
>> #8 0x000000000070e475 in siesta () at siesta.F:23
>> #9 0x000000000045e47c in main ()
>>
>> Can anybody shed some light here on what could be wrong?
>>
>> Thanks,
>>
>> Yong Qin
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users