Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] OpenMPI 1.6.4, MPI I/O on Lustre, 32bit: bug?
From: Paul Kapinos (kapinos_at_[hidden])
Date: 2013-03-25 08:46:20


Hello,
we observe the following divide-by-zero error:

[linuxscc005:31416] *** Process received signal ***
[linuxscc005:31416] Signal: Floating point exception (8)
[linuxscc005:31416] Signal code: Integer divide-by-zero (1)
[linuxscc005:31416] Failing at address: 0x2282db
[linuxscc005:31416] [ 0] [0x3a9410]
[linuxscc005:31416] [ 1] /lib/libgcc_s.so.1(__divdi3+0x8b) [0x2282db]
[linuxscc005:31416] [ 2]
/opt/MPI/openmpi-1.6.4/linux/intel/lib/lib32/libmpi.so.1(ADIOI_LUSTRE_WriteStrided+0x1c36)
[0x8c8206]
[linuxscc005:31416] [ 3]
/opt/MPI/openmpi-1.6.4/linux/intel/lib/lib32/libmpi.so.1(MPIOI_File_write+0x1f2)
[0x8ed752]
[linuxscc005:31416] [ 4]
/opt/MPI/openmpi-1.6.4/linux/intel/lib/lib32/libmpi.so.1(mca_io_romio_dist_MPI_File_write+0x33)
[0x8ed553]
[linuxscc005:31416] [ 5]
/opt/MPI/openmpi-1.6.4/linux/intel/lib/lib32/libmpi.so.1(mca_io_romio_file_write+0x2e)
[0x8a46fe]
[linuxscc005:31416] [ 6]
/opt/MPI/openmpi-1.6.4/linux/intel/lib/lib32/libmpi.so.1(MPI_File_write+0x45)
[0x846c25]
[linuxscc005:31416] [ 7]
/rwthfs/rz/cluster/home/pk224850/SVN/rz_cluster_utils/test_suite/trunk/tests/mpi/mpiIO/mpiIOC32.exe()
[0x804a1ac]
[linuxscc005:31416] [ 8] /lib/libc.so.6(__libc_start_main+0xe6) [0x6fccce6]
[linuxscc005:31416] [ 9]
/rwthfs/rz/cluster/home/pk224850/SVN/rz_cluster_utils/test_suite/trunk/tests/mpi/mpiIO/mpiIOC32.exe()
[0x8049d91]
[linuxscc005:31416] *** End of error message ***

... if we're using Open MPI 1.6.4 for compiling a 'C' test program(*)
(attached), which perform some MPI I/O on Lustre.

0.) The error only came if the binary is compiled in 32bit
1.) the error did not corellate with a compiler used to build the MPI library
(all 4 we have - GCC, Su/Oralce Studio; Intel, PGI - result in the same behaviour)
2.) The error did not came in our version Open MPI / 1.6.1 (however I'm not
really sure the configure options used are the same)
3.) The error did only came if the file to be written is located on the Lustre
file system (no error on local disc or on NFS share).
4.) The Fortran version (also attached) did not have the issue.
5.) The error only occur when using 2 or more processes

On the basis of the error message I believe the error could be located somewhere
indeepth of the OpenMPI/ROMIO implementation...
Well, is somebody interested in further investigation of this issue? If yes we
can feed you with informations. Otherwise we will ignore it, probably...

Best
Paul Kapinos

(*) we've kinda internal test suite in order to check our MPIs...

P.S. $ mpicc -O0 -m32 -o ./mpiIOC32.exe ctest.c -lm

P.S.2 an example cofnigure line:

./configure --with-openib --with-lsf --with-devel-headers
--enable-contrib-no-build=vt --enable-heterogeneous --enable-cxx-exceptions
--enable-orterun-prefix-by-default --disable-dlopen --disable-mca-dso
--with-io-romio-flags='--with-file-system=testfs+ufs+nfs+lustre'
--enable-mpi-ext CFLAGS="$FLAGS_FAST $FLAGS_ARCH32 " CXXFLAGS="$FLAGS_FAST
$FLAGS_ARCH32 " FFLAGS="$FLAGS_FAST $FLAGS_ARCH32 " FCFLAGS="$FLAGS_FAST
$FLAGS_ARCH32 " LDFLAGS="$FLAGS_FAST $FLAGS_ARCH32
-L/opt/lsf/8.0/linux2.6-glibc2.3-x86/lib"
--prefix=/opt/MPI/openmpi-1.6.4/linux/gcc
--mandir=/opt/MPI/openmpi-1.6.4/linux/gcc/man
--bindir=/opt/MPI/openmpi-1.6.4/linux/gcc/bin/32
--libdir=/opt/MPI/openmpi-1.6.4/linux/gcc/lib/lib32
--includedir=/opt/MPI/openmpi-1.6.4/linux/gcc/include/32
--datarootdir=/opt/MPI/openmpi-1.6.4/linux/gcc/share/32 2>&1 | tee log_01_conf.txt

I _believe_ the part
--with-io-romio-flags='--with-file-system=testfs+ufs+nfs+lustre'
is new in our 1.6.4 installation compared with 1.6.1. Could this be the root of
evil?

-- 
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915