Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segfault when combining OpenMPI and GotoBLAS2
From: Gus Correa (gus_at_[hidden])
Date: 2010-01-20 11:15:26


Hi Dorian

Dorian Krause wrote:
> Hi,
>
> @Gus I don't use any flags for the installed OpenMPI version. In fact
> for this mail I used an OpenMPI version just installed with the
> --enable-debug flag.
>

You are right, my guess was wrong.
Just that some people reported trouble with threaded
versions of OpenMPI, specifically when running HPL on top of Goto BLAS.

FYI, my local HPL was built with (non-threaded) OpenMPI 1.3.2 and
Goto BLAS 1.26, which now I see is kind of ancient
(but still seems to be the default at TACC).
I linked other programs to that Goto BLAS also.
It works fine and fast, but this doesn't help you much.

Have you tried other programs linked with your current OpenMPI,
but that do not use Goto BLAS, just to ensure that the problem
is not on OpenMPI?
(Say, the basic programs in the examples directory: connectivity_c.c,
ring_c.c, hello_c.c.)

Anyway, hopefully Kazuhighe Goto will find out what is going on.
Good luck!

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> From what I can tell from stepping through the debugger the problem
> happens in btl_openib_component_init:
>
> #0 btl_openib_component_init (num_btl_modules=0x7fff7a8593e8,
> enable_progress_threads=false, enable_mpi_threads=false)
> at
> /home/kraused/ompi/openmpi-1.4/ompi/mca/btl/openib/btl_openib_component.c:2099
>
> #1 0x00002b9eb6f65679 in mca_btl_base_select
> (enable_progress_threads=false, enable_mpi_threads=false)
> at
> /home/kraused/ompi/openmpi-1.4/ompi/mca/btl/base/btl_base_select.c:110
> #2 0x00002aaad007d933 in mca_bml_r2_component_init
> (priority=0x7fff7a8594b4, enable_progress_threads=false,
> enable_mpi_threads=false)
> at /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/r2/bml_r2_component.c:86
> #3 0x00002b9eb6f64a80 in mca_bml_base_init
> (enable_progress_threads=false, enable_mpi_threads=false)
> at /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/base/bml_base_init.c:69
> #4 0x00002aaacfc5580a in mca_pml_ob1_component_init
> (priority=0x7fff7a8595d0, enable_progress_threads=false,
> enable_mpi_threads=false)
> at
> /home/kraused/ompi/openmpi-1.4/ompi/mca/pml/ob1/pml_ob1_component.c:168
> #5 0x00002b9eb6f787a4 in mca_pml_base_select
> (enable_progress_threads=false, enable_mpi_threads=false)
> at
> /home/kraused/ompi/openmpi-1.4/ompi/mca/pml/base/pml_base_select.c:126
> #6 0x00002b9eb6ef4989 in ompi_mpi_init (argc=1, argv=0x7fff7a859af8,
> requested=0, provided=0x7fff7a859858)
> at /home/kraused/ompi/openmpi-1.4/ompi/runtime/ompi_mpi_init.c:534
> #7 0x00002b9eb6f33bb2 in PMPI_Init (argc=0x7fff7a8598cc,
> argv=0x7fff7a8598c0) at
> /home/kraused/ompi/openmpi-1.4/ompi/mpi/c/profile/pinit.c:80
> #8 0x00000000004007e6 in main (argc=1, argv=0x7fff7a859af8) at
> /home/kraused/blas.c:20
>
> When I set a breakpoint in btl_openib_component_init and continue from
> there I get a SIGILL but the backtrace is meaningless to me:
>
> Program received signal SIGILL, Illegal instruction.
> [Switching to Thread 0x40901940 (LWP 21183)]
> 0x00007fff23b2a7c0 in ?? ()
> (gdb) bt
> #0 0x00007fff23b2a7c0 in ?? ()
> #1 0x0000003df9c06307 in start_thread () from /lib64/libpthread.so.0
> #2 0x0000003df90d1ded in clone () from /lib64/libc.so.6
> #3 0x0000000000000000 in ?? ()
>
>
> The bad thing is: If I step through btl_openib_component_init right
> after the call to ompi_btl_openib_fd_init and continue from there the
> program finishes.
>
> More precisely: stepping beyond the pthread_create call at line 537 in
> btl_openib_fd.c and afterwards I can continue.
> I conjecture that gdb influences the threading here and therefore the
> problem doesn't show up?!
>
> I'm interested in digging further but I need some advices/hints where to
> go from here.
>
> Thanks,
> Dorian
>
>
> On 1/19/10 1:29 PM, Jeff Squyres wrote:
>> Can you get a core dump, or otherwise see exactly where the seg fault
>> is occurring?
>>
>> On Jan 18, 2010, at 8:34 AM, Dorian Krause wrote:
>>
>>
>>> Hi Eloi,
>>>
>>>> Does the segmentation faults you're facing also happen in a sequential
>>>> environment (i.e. not linked against openmpi libraries) ?
>>>>
>>> No, without MPI everything works fine. Also, linking against mvapich
>>> doesn't give any errors. I think there is a problem with GotoBLAS and
>>> the shared library infrastructure of OpenMPI. The code doesn't come to
>>> the point to execute the gemm operation at all.
>>>
>>>
>>>> Have you already informed Kazushige Goto (developer of Gotoblas) ?
>>>>
>>> Not yet. Since the problem only happens with openmpi and the BLAS
>>> (stand-alone) seems to work, I thought the openmpi mailing list would be
>>> the better place to discuss this (to get a grasp of what the error could
>>> be before going to the GotoBLAS mailing list).
>>>
>>>
>>>> Regards,
>>>> Eloi
>>>>
>>>> PS: Could you post your Makefile.rule here so that we could check the
>>>> different compilation options chosen ?
>>>>
>>> I didn't make any changes to the Makefile.rules. This is the content of
>>> Makefile.conf:
>>>
>>> OSNAME=Linux
>>> ARCH=x86_64
>>> C_COMPILER=GCC
>>> BINARY32=
>>> BINARY64=1
>>> CEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
>>> -L/lib/../lib64 -L/usr/lib/../lib64 -lc
>>> F_COMPILER=GFORTRAN
>>> FC=gfortran
>>> BU=_
>>> FEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
>>> -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
>>> -L/lib/../lib64 -L/usr/lib/../lib64 -lgfortran -lm -lgfortran -lm -lc
>>> CORE=BARCELONA
>>> LIBCORE=barcelona
>>> NUM_CORES=8
>>> HAVE_MMX=1
>>> HAVE_SSE=1
>>> HAVE_SSE2=1
>>> HAVE_SSE3=1
>>> HAVE_SSE4A=1
>>> HAVE_3DNOWEX=1
>>> HAVE_3DNOW=1
>>> MAKE += -j 8
>>> SGEMM_UNROLL_M=8
>>> SGEMM_UNROLL_N=4
>>> DGEMM_UNROLL_M=4
>>> DGEMM_UNROLL_N=4
>>> QGEMM_UNROLL_M=2
>>> QGEMM_UNROLL_N=2
>>> CGEMM_UNROLL_M=4
>>> CGEMM_UNROLL_N=2
>>> ZGEMM_UNROLL_M=2
>>> ZGEMM_UNROLL_N=2
>>> XGEMM_UNROLL_M=1
>>> XGEMM_UNROLL_N=1
>>>
>>>
>>> Thanks,
>>> Dorian
>>>
>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users