Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
From: Joshua Baker-LePain (jlb17_at_[hidden])
Date: 2012-03-13 19:22:36

On Tue, 13 Mar 2012 at 10:57pm, Gutierrez, Samuel K wrote

> Fooey. What compiler are you using to build Open MPI and how are you
> configuring your build?

I'm using gcc as packaged by RH/CentOS 6.2:

[jlb_at_opt200 1.4.5-2]$ gcc --version
gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)

I actually tried 2 custom builds of Open MPI 1.4.5. For the first I tried
to stick close to the options in RH's compat-openmpi SRPM:

./configure --prefix=$HOME/ompi-1.4.5 --enable-mpi-threads --enable-openib-ibcm --with-sge --with-libltdl=external --with-valgrind --enable-memchecker --with-psm=no --with-esmtp LDFLAGS='-Wl,-z,noexecstack'

That resulted in the backtrace I sent previously:
#0 0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#1 0x00002b00967737ca in opal_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#2 0x00002b00975ef8d5 in barrier ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#3 0x00002b009628da24 in ompi_mpi_init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#4 0x00002b00962b24f0 in PMPI_Init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#5 0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
     at mpihello-long.c:11

For kicks, I tried a 2nd compile of 1.4.5 with a bare minimum of options:

./configure --prefix=$HOME/ompi-1.4.5 --with-sge

That resulted in a slightly different backtrace that seems to be missing
a bit:
#0 0x00002b7bbc8681d0 in ?? ()
#1 <signal handler called>
#2 0x00002b7bbd2b8f6c in mca_btl_sm_component_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#3 0x00002b7bb9b2feda in opal_progress ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#4 0x00002b7bba9a98d5 in barrier ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/
#5 0x00002b7bb965d426 in ompi_mpi_init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#6 0x00002b7bb967cba0 in PMPI_Init ()
    from /netapp/sali/jlb/ompi-1.4.5/lib/
#7 0x0000000000400826 in main (argc=1, argv=0x7fff93634788)
     at mpihello-long.c:11

> Can you also run with a debug build of Open MPI
> so we can see the line numbers?

I'll do that first thing tomorrow.

>>> Another question. How reproducible is this on your system?
>> In my testing today, it's been 100% reproducible.
> That's surprising.

Heh. You're telling me.

Thanks for taking an interest in this.

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin