Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] problems with Intel 12.x compilers and OpenMPI (1.4.3)
From: Paul Kapinos (kapinos_at_[hidden])
Date: 2011-09-23 05:51:14


Hi Open MPI volks,

we see some quite strange effects with our installations of Open MPI
1.4.3 with Intel 12.x compilers, which makes us puzzling: Different
programs reproducibly deadlock or die with errors alike the below-listed
ones.

Some of the errors looks alike programming issue at first look (well, a
deadlock *is* usually a programming error) but we do not believe it is
so: the errors arise in many well-tested codes including HPL (*) and
only with a special compiler + Open MPI version (Intel 12.x compiler +
open MPI 1.4.3) and only on special number of processes (usually high
ones). For example, HPL reproducible deadlocks with 72 procs and dies
with error message #2 with 384 processes.

All this errors seem to be somehow related to MPI communicators; and
1.4.4rc3 and in 1.5.3 and 1.5.4 seem not to have this problem. Also
1.4.3 if using together with Intel 11.x compielr series seem to be
unproblematic. So probably this:

(1.4.4 release notes:)
- Fixed a segv in MPI_Comm_create when called with GROUP_EMPTY.
   Thanks to Dominik Goeddeke for finding this.

is also fix for our issues? Or maybe not, because 1.5.3 is _older_ than
this fix?

As far as we workarounded the problem by switching our production to
1.5.3 this issue is not a "burning" one; but I decieded still to post
this because any issue on such fundamental things may be interesting for
developers.

Best wishes,
Paul Kapinos

(*) http://www.netlib.org/benchmark/hpl/

################################################################
Fatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(111): MPI_Comm_size(comm=0x0, size=0x6f4a90) failed
MPI_Comm_size(69).: Invalid communicator

################################################################
[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** An error occurred in MPI_Comm_split
[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** on communicator MPI
COMMUNICATOR 3 SPLIT FROM 0
[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERR_IN_STATUS: error code
in status
[linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERRORS_ARE_FATAL (your MPI
job will now abort)

################################################################
forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
libmpi.so.0 00002AAAAD9EDF52 Unknown Unknown Unknown
libmpi.so.0 00002AAAAD9EE45D Unknown Unknown Unknown
libmpi.so.0 00002AAAAD9C3375 Unknown Unknown Unknown
libmpi_f77.so.0 00002AAAAD75C37A Unknown Unknown Unknown
vasp_mpi_gamma 000000000057E010 Unknown Unknown Unknown
vasp_mpi_gamma 000000000059F636 Unknown Unknown Unknown
vasp_mpi_gamma 0000000000416C5A Unknown Unknown Unknown
vasp_mpi_gamma 0000000000A62BEE Unknown Unknown Unknown
libc.so.6 0000003EEB61EC5D Unknown Unknown Unknown
vasp_mpi_gamma 0000000000416A29 Unknown Unknown Unknown

-- 
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915