Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segfault when combining OpenMPI and GotoBLAS2
From: Eloi Gaudry (eg_at_[hidden])
Date: 2010-01-20 11:38:39


Hi,

FYI, This issue is solved with the last version of the library
(v2-1.11), at least on my side.

Eloi

Gus Correa wrote:
> Hi Dorian
>
> Dorian Krause wrote:
>> Hi,
>>
>> @Gus I don't use any flags for the installed OpenMPI version. In fact
>> for this mail I used an OpenMPI version just installed with the
>> --enable-debug flag.
>>
>
> You are right, my guess was wrong.
> Just that some people reported trouble with threaded
> versions of OpenMPI, specifically when running HPL on top of Goto BLAS.
>
> FYI, my local HPL was built with (non-threaded) OpenMPI 1.3.2 and
> Goto BLAS 1.26, which now I see is kind of ancient
> (but still seems to be the default at TACC).
> I linked other programs to that Goto BLAS also.
> It works fine and fast, but this doesn't help you much.
>
> Have you tried other programs linked with your current OpenMPI,
> but that do not use Goto BLAS, just to ensure that the problem
> is not on OpenMPI?
> (Say, the basic programs in the examples directory: connectivity_c.c,
> ring_c.c, hello_c.c.)
>
> Anyway, hopefully Kazuhighe Goto will find out what is going on.
> Good luck!
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
>> From what I can tell from stepping through the debugger the problem
>> happens in btl_openib_component_init:
>>
>> #0 btl_openib_component_init (num_btl_modules=0x7fff7a8593e8,
>> enable_progress_threads=false, enable_mpi_threads=false)
>> at
>> /home/kraused/ompi/openmpi-1.4/ompi/mca/btl/openib/btl_openib_component.c:2099
>>
>> #1 0x00002b9eb6f65679 in mca_btl_base_select
>> (enable_progress_threads=false, enable_mpi_threads=false)
>> at
>> /home/kraused/ompi/openmpi-1.4/ompi/mca/btl/base/btl_base_select.c:110
>> #2 0x00002aaad007d933 in mca_bml_r2_component_init
>> (priority=0x7fff7a8594b4, enable_progress_threads=false,
>> enable_mpi_threads=false)
>> at
>> /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/r2/bml_r2_component.c:86
>> #3 0x00002b9eb6f64a80 in mca_bml_base_init
>> (enable_progress_threads=false, enable_mpi_threads=false)
>> at
>> /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/base/bml_base_init.c:69
>> #4 0x00002aaacfc5580a in mca_pml_ob1_component_init
>> (priority=0x7fff7a8595d0, enable_progress_threads=false,
>> enable_mpi_threads=false)
>> at
>> /home/kraused/ompi/openmpi-1.4/ompi/mca/pml/ob1/pml_ob1_component.c:168
>> #5 0x00002b9eb6f787a4 in mca_pml_base_select
>> (enable_progress_threads=false, enable_mpi_threads=false)
>> at
>> /home/kraused/ompi/openmpi-1.4/ompi/mca/pml/base/pml_base_select.c:126
>> #6 0x00002b9eb6ef4989 in ompi_mpi_init (argc=1, argv=0x7fff7a859af8,
>> requested=0, provided=0x7fff7a859858)
>> at /home/kraused/ompi/openmpi-1.4/ompi/runtime/ompi_mpi_init.c:534
>> #7 0x00002b9eb6f33bb2 in PMPI_Init (argc=0x7fff7a8598cc,
>> argv=0x7fff7a8598c0) at
>> /home/kraused/ompi/openmpi-1.4/ompi/mpi/c/profile/pinit.c:80
>> #8 0x00000000004007e6 in main (argc=1, argv=0x7fff7a859af8) at
>> /home/kraused/blas.c:20
>>
>> When I set a breakpoint in btl_openib_component_init and continue
>> from there I get a SIGILL but the backtrace is meaningless to me:
>>
>> Program received signal SIGILL, Illegal instruction.
>> [Switching to Thread 0x40901940 (LWP 21183)]
>> 0x00007fff23b2a7c0 in ?? ()
>> (gdb) bt
>> #0 0x00007fff23b2a7c0 in ?? ()
>> #1 0x0000003df9c06307 in start_thread () from /lib64/libpthread.so.0
>> #2 0x0000003df90d1ded in clone () from /lib64/libc.so.6
>> #3 0x0000000000000000 in ?? ()
>>
>>
>> The bad thing is: If I step through btl_openib_component_init right
>> after the call to ompi_btl_openib_fd_init and continue from there the
>> program finishes.
>>
>> More precisely: stepping beyond the pthread_create call at line 537
>> in btl_openib_fd.c and afterwards I can continue.
>> I conjecture that gdb influences the threading here and therefore the
>> problem doesn't show up?!
>>
>> I'm interested in digging further but I need some advices/hints where
>> to go from here.
>>
>> Thanks,
>> Dorian
>>
>>
>> On 1/19/10 1:29 PM, Jeff Squyres wrote:
>>> Can you get a core dump, or otherwise see exactly where the seg
>>> fault is occurring?
>>>
>>> On Jan 18, 2010, at 8:34 AM, Dorian Krause wrote:
>>>
>>>
>>>> Hi Eloi,
>>>>
>>>>> Does the segmentation faults you're facing also happen in a
>>>>> sequential
>>>>> environment (i.e. not linked against openmpi libraries) ?
>>>>>
>>>> No, without MPI everything works fine. Also, linking against mvapich
>>>> doesn't give any errors. I think there is a problem with GotoBLAS and
>>>> the shared library infrastructure of OpenMPI. The code doesn't come to
>>>> the point to execute the gemm operation at all.
>>>>
>>>>
>>>>> Have you already informed Kazushige Goto (developer of Gotoblas) ?
>>>>>
>>>> Not yet. Since the problem only happens with openmpi and the BLAS
>>>> (stand-alone) seems to work, I thought the openmpi mailing list
>>>> would be
>>>> the better place to discuss this (to get a grasp of what the error
>>>> could
>>>> be before going to the GotoBLAS mailing list).
>>>>
>>>>
>>>>> Regards,
>>>>> Eloi
>>>>>
>>>>> PS: Could you post your Makefile.rule here so that we could check the
>>>>> different compilation options chosen ?
>>>>>
>>>> I didn't make any changes to the Makefile.rules. This is the
>>>> content of
>>>> Makefile.conf:
>>>>
>>>> OSNAME=Linux
>>>> ARCH=x86_64
>>>> C_COMPILER=GCC
>>>> BINARY32=
>>>> BINARY64=1
>>>> CEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
>>>> -L/lib/../lib64 -L/usr/lib/../lib64 -lc
>>>> F_COMPILER=GFORTRAN
>>>> FC=gfortran
>>>> BU=_
>>>> FEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
>>>> -L/lib/../lib64 -L/usr/lib/../lib64 -lgfortran -lm -lgfortran -lm -lc
>>>> CORE=BARCELONA
>>>> LIBCORE=barcelona
>>>> NUM_CORES=8
>>>> HAVE_MMX=1
>>>> HAVE_SSE=1
>>>> HAVE_SSE2=1
>>>> HAVE_SSE3=1
>>>> HAVE_SSE4A=1
>>>> HAVE_3DNOWEX=1
>>>> HAVE_3DNOW=1
>>>> MAKE += -j 8
>>>> SGEMM_UNROLL_M=8
>>>> SGEMM_UNROLL_N=4
>>>> DGEMM_UNROLL_M=4
>>>> DGEMM_UNROLL_N=4
>>>> QGEMM_UNROLL_M=2
>>>> QGEMM_UNROLL_N=2
>>>> CGEMM_UNROLL_M=4
>>>> CGEMM_UNROLL_N=2
>>>> ZGEMM_UNROLL_M=2
>>>> ZGEMM_UNROLL_N=2
>>>> XGEMM_UNROLL_M=1
>>>> XGEMM_UNROLL_N=1
>>>>
>>>>
>>>> Thanks,
>>>> Dorian