Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segfault when combining OpenMPI and GotoBLAS2
From: Dorian Krause (doriankrause_at_[hidden])
Date: 2010-01-19 20:39:01


Hi,

@Gus I don't use any flags for the installed OpenMPI version. In fact
for this mail I used an OpenMPI version just installed with the
--enable-debug flag.

 From what I can tell from stepping through the debugger the problem
happens in btl_openib_component_init:

#0 btl_openib_component_init (num_btl_modules=0x7fff7a8593e8,
enable_progress_threads=false, enable_mpi_threads=false)
     at
/home/kraused/ompi/openmpi-1.4/ompi/mca/btl/openib/btl_openib_component.c:2099
#1 0x00002b9eb6f65679 in mca_btl_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
     at
/home/kraused/ompi/openmpi-1.4/ompi/mca/btl/base/btl_base_select.c:110
#2 0x00002aaad007d933 in mca_bml_r2_component_init
(priority=0x7fff7a8594b4, enable_progress_threads=false,
enable_mpi_threads=false)
     at /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/r2/bml_r2_component.c:86
#3 0x00002b9eb6f64a80 in mca_bml_base_init
(enable_progress_threads=false, enable_mpi_threads=false)
     at /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/base/bml_base_init.c:69
#4 0x00002aaacfc5580a in mca_pml_ob1_component_init
(priority=0x7fff7a8595d0, enable_progress_threads=false,
enable_mpi_threads=false)
     at
/home/kraused/ompi/openmpi-1.4/ompi/mca/pml/ob1/pml_ob1_component.c:168
#5 0x00002b9eb6f787a4 in mca_pml_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
     at
/home/kraused/ompi/openmpi-1.4/ompi/mca/pml/base/pml_base_select.c:126
#6 0x00002b9eb6ef4989 in ompi_mpi_init (argc=1, argv=0x7fff7a859af8,
requested=0, provided=0x7fff7a859858)
     at /home/kraused/ompi/openmpi-1.4/ompi/runtime/ompi_mpi_init.c:534
#7 0x00002b9eb6f33bb2 in PMPI_Init (argc=0x7fff7a8598cc,
argv=0x7fff7a8598c0) at
/home/kraused/ompi/openmpi-1.4/ompi/mpi/c/profile/pinit.c:80
#8 0x00000000004007e6 in main (argc=1, argv=0x7fff7a859af8) at
/home/kraused/blas.c:20

When I set a breakpoint in btl_openib_component_init and continue from
there I get a SIGILL but the backtrace is meaningless to me:

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 0x40901940 (LWP 21183)]
0x00007fff23b2a7c0 in ?? ()
(gdb) bt
#0 0x00007fff23b2a7c0 in ?? ()
#1 0x0000003df9c06307 in start_thread () from /lib64/libpthread.so.0
#2 0x0000003df90d1ded in clone () from /lib64/libc.so.6
#3 0x0000000000000000 in ?? ()

The bad thing is: If I step through btl_openib_component_init right
after the call to ompi_btl_openib_fd_init and continue from there the
program finishes.

More precisely: stepping beyond the pthread_create call at line 537 in
btl_openib_fd.c and afterwards I can continue.
I conjecture that gdb influences the threading here and therefore the
problem doesn't show up?!

I'm interested in digging further but I need some advices/hints where to
go from here.

Thanks,
Dorian

On 1/19/10 1:29 PM, Jeff Squyres wrote:
> Can you get a core dump, or otherwise see exactly where the seg fault is occurring?
>
> On Jan 18, 2010, at 8:34 AM, Dorian Krause wrote:
>
>
>> Hi Eloi,
>>
>>> Does the segmentation faults you're facing also happen in a sequential
>>> environment (i.e. not linked against openmpi libraries) ?
>>>
>> No, without MPI everything works fine. Also, linking against mvapich
>> doesn't give any errors. I think there is a problem with GotoBLAS and
>> the shared library infrastructure of OpenMPI. The code doesn't come to
>> the point to execute the gemm operation at all.
>>
>>
>>> Have you already informed Kazushige Goto (developer of Gotoblas) ?
>>>
>> Not yet. Since the problem only happens with openmpi and the BLAS
>> (stand-alone) seems to work, I thought the openmpi mailing list would be
>> the better place to discuss this (to get a grasp of what the error could
>> be before going to the GotoBLAS mailing list).
>>
>>
>>> Regards,
>>> Eloi
>>>
>>> PS: Could you post your Makefile.rule here so that we could check the
>>> different compilation options chosen ?
>>>
>> I didn't make any changes to the Makefile.rules. This is the content of
>> Makefile.conf:
>>
>> OSNAME=Linux
>> ARCH=x86_64
>> C_COMPILER=GCC
>> BINARY32=
>> BINARY64=1
>> CEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
>> -L/lib/../lib64 -L/usr/lib/../lib64 -lc
>> F_COMPILER=GFORTRAN
>> FC=gfortran
>> BU=_
>> FEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
>> -L/lib/../lib64 -L/usr/lib/../lib64 -lgfortran -lm -lgfortran -lm -lc
>> CORE=BARCELONA
>> LIBCORE=barcelona
>> NUM_CORES=8
>> HAVE_MMX=1
>> HAVE_SSE=1
>> HAVE_SSE2=1
>> HAVE_SSE3=1
>> HAVE_SSE4A=1
>> HAVE_3DNOWEX=1
>> HAVE_3DNOW=1
>> MAKE += -j 8
>> SGEMM_UNROLL_M=8
>> SGEMM_UNROLL_N=4
>> DGEMM_UNROLL_M=4
>> DGEMM_UNROLL_N=4
>> QGEMM_UNROLL_M=2
>> QGEMM_UNROLL_N=2
>> CGEMM_UNROLL_M=4
>> CGEMM_UNROLL_N=2
>> ZGEMM_UNROLL_M=2
>> ZGEMM_UNROLL_N=2
>> XGEMM_UNROLL_M=1
>> XGEMM_UNROLL_N=1
>>
>>
>> Thanks,
>> Dorian
>>
>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>