Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] #1506
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-09-23 11:18:52


Hi George,

It seems like some data corruption in Reduce_scatter function

I discovered it when added -DCHECK to IMB benchmark, and it seemed to be
there for ages.

it runs with voltaire MPI, but failes with OMPI. you will get a seqv with
IMB3.1 and error with IMB3.0

host#VER=TRUNK ; /home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 2 -H
witch8 /home/BENCHMARKS/PALLAS/IMB_3.0v/src/IMB-MPI1_${VER} Reduce_scatter

#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.0v modified by Voltaire, MPI-1 part
#---------------------------------------------------
# Date : Tue Sep 23 18:05:35 2008
# Machine : x86_64
# System : Linux
# Release : 2.6.16.46-0.12-smp
# Version : #1 SMP Thu May 17 14:00:09 UTC 2007
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 67108864
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Reduce_scatter

#-----------------------------------------------------------------------------
# Benchmarking Reduce_scatter
# #processes = 2
#-----------------------------------------------------------------------------
#Benchmarking #procs #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
defects
Reduce_scatter 2 0 1000 0.05 0.05 0.05 0.00
0: Error Reduce_scatter, size = 4, sample #0
Process 0: Got invalid buffer:
Buffer entry: 817291591680.000000
pos: 0
Process 0: Expected buffer:
Buffer entry: 0.000000
Reduce_scatter 2 4 1000 0.98 1.06 1.02 1.00
Application error code 1 occurred
[witch8:10190] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 17
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 10190 on
node witch8 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------