Subject: Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
From: Joshua Baker-LePain (jlb17_at_[hidden])
Date: 2012-03-13 19:22:36

On Tue, 13 Mar 2012 at 10:57pm, Gutierrez, Samuel K wrote

> Fooey. What compiler are you using to build Open MPI and how are you
> configuring your build?

I'm using gcc as packaged by RH/CentOS 6.2:

[jlb_at_opt200 1.4.5-2]$ gcc --version
gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)

I actually tried 2 custom builds of Open MPI 1.4.5. For the first I tried
to stick close to the options in RH's compat-openmpi SRPM:

./configure --prefix=$HOME/ompi-1.4.5 --enable-mpi-threads --enable-openib-ibcm --with-sge --with-libltdl=external --with-valgrind --enable-memchecker --with-psm=no --with-esmtp LDFLAGS='-Wl,-z,noexecstack'

That resulted in the backtrace I sent previously:
#0 0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#1 0x00002b00967737ca in opal_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#2 0x00002b00975ef8d5 in barrier ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#3 0x00002b009628da24 in ompi_mpi_init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#4 0x00002b00962b24f0 in PMPI_Init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#5 0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
     at mpihello-long.c:11

For kicks, I tried a 2nd compile of 1.4.5 with a bare minimum of options:

./configure --prefix=$HOME/ompi-1.4.5 --with-sge

That resulted in a slightly different backtrace that seems to be missing
a bit:
#0 0x00002b7bbc8681d0 in ?? ()
#1 <signal handler called>
#2 0x00002b7bbd2b8f6c in mca_btl_sm_component_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#3 0x00002b7bb9b2feda in opal_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#4 0x00002b7bba9a98d5 in barrier ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#5 0x00002b7bb965d426 in ompi_mpi_init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#6 0x00002b7bb967cba0 in PMPI_Init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#7 0x0000000000400826 in main (argc=1, argv=0x7fff93634788)
     at mpihello-long.c:11

> Can you also run with a debug build of Open MPI
> so we can see the line numbers?

I'll do that first thing tomorrow.

>>> Another question. How reproducible is this on your system?
>> In my testing today, it's been 100% reproducible.
> That's surprising.

Heh. You're telling me.

Thanks for taking an interest in this.

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin