The failure signature isn't exactly what we were seeing here at LANL, but there were misplaced memory barriers in Open MPI 1.4.3.  Ticket 2619 talks about this issue (https://svn.open-mpi.org/trac/ompi/ticket/2619).  This doesn't explain, however, the failures that you are experiencing within Open MPI 1.5.4.  Can you give 1.4.4 a whirl and see if this fixes the issue?  Any more information surrounding your failures in 1.5.4 are greatly appreciated.

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Mar 13, 2012, at 1:35 PM, Joshua Baker-LePain wrote:

On Tue, 13 Mar 2012 at 7:20pm, Gutierrez, Samuel K wrote

Just to be clear, what specific version of Open MPI produced the provided backtrace?  This smells like a missing memory barrier problem.

The backtrace in my original post was from 1.5.4 -- I took the 1.5.4 source and put it into the 1.5.3 SRPM provided by Red Hat.  Below is a backtrace from 1.4.3 as shipped by RH/CentOS:

#0  sm_fifo_read () at btl_sm.h:267
#1  mca_btl_sm_component_progress () at btl_sm_component.c:391
#2  0x0000003e54a129ca in opal_progress () at runtime/opal_progress.c:207
#3  0x00002b00fa6bb8d5 in barrier () at grpcomm_bad_module.c:270
#4  0x0000003e55e37d04 in ompi_mpi_init (argc=<value optimized out>,
   argv=<value optimized out>, requested=<value optimized out>,
   provided=<value optimized out>) at runtime/ompi_mpi_init.c:722
#5  0x0000003e55e5bae0 in PMPI_Init (argc=0x7fff8588b1cc, argv=0x7fff8588b1c0)
   at pinit.c:80
#6  0x0000000000400826 in main (argc=1, argv=0x7fff8588b2c8)
   at mpihello-long.c:11

Thanks!

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users