Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
From: Gutierrez, Samuel K (samuel_at_[hidden])
Date: 2012-03-13 15:53:48

The failure signature isn't exactly what we were seeing here at LANL, but there were misplaced memory barriers in Open MPI 1.4.3. Ticket 2619 talks about this issue ( This doesn't explain, however, the failures that you are experiencing within Open MPI 1.5.4. Can you give 1.4.4 a whirl and see if this fixes the issue? Any more information surrounding your failures in 1.5.4 are greatly appreciated.


Samuel K. Gutierrez
Los Alamos National Laboratory
On Mar 13, 2012, at 1:35 PM, Joshua Baker-LePain wrote:
On Tue, 13 Mar 2012 at 7:20pm, Gutierrez, Samuel K wrote
Just to be clear, what specific version of Open MPI produced the provided backtrace?  This smells like a missing memory barrier problem.
The backtrace in my original post was from 1.5.4 -- I took the 1.5.4 source and put it into the 1.5.3 SRPM provided by Red Hat.  Below is a backtrace from 1.4.3 as shipped by RH/CentOS:
#0  sm_fifo_read () at btl_sm.h:267
#1  mca_btl_sm_component_progress () at btl_sm_component.c:391
#2  0x0000003e54a129ca in opal_progress () at runtime/opal_progress.c:207
#3  0x00002b00fa6bb8d5 in barrier () at grpcomm_bad_module.c:270
#4  0x0000003e55e37d04 in ompi_mpi_init (argc=<value optimized out>,
   argv=<value optimized out>, requested=<value optimized out>,
   provided=<value optimized out>) at runtime/ompi_mpi_init.c:722
#5  0x0000003e55e5bae0 in PMPI_Init (argc=0x7fff8588b1cc, argv=0x7fff8588b1c0)
   at pinit.c:80
#6  0x0000000000400826 in main (argc=1, argv=0x7fff8588b2c8)
   at mpihello-long.c:11
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
users mailing list