Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] bug in MPI_ACCUMULATE for window offsets > 2**31 - 1 bytes? openmpi v1.2.5
From: Stefan Knecht (stefan_at_[hidden])
Date: 2008-02-11 08:05:46


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Tim,

Many thanks for the fix!
Everything works fine now with the current trunk version.

Best regards,

stefan

| Tim Prins wrote:
| The fix I previously sent to the list has been committed in r17400.
|
| Thanks,
|
| Tim

| Tim Prins wrote:
| Hi Stefan,
|
| I was able to verify the problem. Turns out this is a problem with other
| onesided operations as well. Attached is a simple test case I made in c
| using MPI_Put that also fails.
|
| The problem is that the target count and displacements are both sent as
| signed 32 bit integers. Then, the receiver multiplies them together and
| adds them to the window base. However, this multiplication is done using
| the signed 32 bit integers, which overflows. This is then added to the
| 64 bit pointer. This, of course, results in a bad address.
|
| I have attached a patch against a recent development version that fixes
| this for me. I am also copying Brian Barrett, who did all the work on
| the onesided code.
|
| Brian: if possible, please take a look at the attached patch and test case.
|
| Thanks for the report!
|
| Tim Prins
|
| Stefan Knecht wrote:
| Hi all,
|
| I encounter a problem with the routine MPI_ACCUMULATE trying to sum up
| MPI_REAL8's on a large memory window with a large offset.
| My program running (on a single processor, x86_64 architecture)
| crashes with
| an error message like:
|
| [node14:16236] *** Process received signal ***
| [node14:16236] Signal: Segmentation fault (11)
| [node14:16236] Signal code: Address not mapped (1)
| [node14:16236] Failing at address: 0x2aaa32b16000
| [node14:16236] [ 0] /lib64/libpthread.so.0 [0x32e080de00]
| [node14:16236] [ 1]
| /home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(ompi_mpi_op_sum_double+0x10)
| [0x2aaaaaf15530]
| [node14:16236] [ 2]
| /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_process_op+0x2d7)
|
| [0x2aaab1a47257]
| [node14:16236] [ 3]
| /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so
| [0x2aaab1a45432]
| [node14:16236] [ 4]
| /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_passive_unlock+0x93)
|
| [0x2aaab1a48243]
| [node14:16236] [ 5]
| /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so
| [0x2aaab1a43436]
| [node14:16236] [ 6]
| /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_progress+0xff)
|
| [0x2aaab1a42e0f]
| [node14:16236] [ 7]
| /home/stefan/bin/openmpi-1.2.5/lib/libopen-pal.so.0(opal_progress+0x4a)
| [0x2aaaab3dfa0a]
| [node14:16236] [ 8]
| /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_module_unlock+0x2a9)
|
| [0x2aaab1a48629]
| [node14:16236] [ 9]
| /home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(PMPI_Win_unlock+0xe1)
| [0x2aaaaaf4a291]
| [node14:16236] [10]
| /home/stefan/bin/openmpi-1.2.5/lib/libmpi_f77.so.0(mpi_win_unlock_+0x25)
| [0x2aaaaacdd8c5]
| [node14:16236] [11] /home/stefan/calc/mpi2_test/a.out(MAIN__+0x809)
| [0x401851]
| [node14:16236] [12] /home/stefan/calc/mpi2_test/a.out(main+0xe)
| [0x401bbe]
| [node14:16236] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
| [0x32dfc1dab4]
| [node14:16236] [14] /home/stefan/calc/mpi2_test/a.out [0x400f99]
| [node14:16236] *** End of error message ***
| mpirun noticed that job rank 0 with PID 16236 on node node14 exited on
| signal 11 (Segmentation fault).
|
|
| The relevant part of my FORTRAN source code reads as:
|
| ~ program accumulate_test
| ~ IMPLICIT REAL*8 (A-H,O-Z)
| ~ include 'mpif.h'
| ~ INTEGER(KIND=MPI_OFFSET_KIND) MX_SIZE_M
| C dummy size parameter
| ~ PARAMETER (MX_SIZE_M = 1 000 000)
| ~ INTEGER MPIerr, MYID, NPROC
| ~ INTEGER ITARGET, MY_X_WIN, JCOUNT, JCOUNT_T
| ~ INTEGER(KIND=MPI_ADDRESS_KIND) MEM_X, MEM_Y
| ~ INTEGER(KIND=MPI_ADDRESS_KIND) IDISPL_WIN
| ~ INTEGER(KIND=MPI_ADDRESS_KIND) PTR1, PTR2
| ~ INTEGER(KIND=MPI_INTEGER_KIND) ISIZE_REAL8
| ~ INTEGER*8 NELEMENT_X, NELEMENT_Y
| ~ POINTER (PTR1, XMAT(MX_SIZE_M))
| ~ POINTER (PTR2, YMAT(MX_SIZE_M))
| C
| ~ CALL MPI_INIT( MPIerr )
| ~ CALL MPI_COMM_RANK( MPI_COMM_WORLD, MYID, MPIerr)
| ~ CALL MPI_COMM_SIZE( MPI_COMM_WORLD, NPROC, MPIerr)
| C
| ~ NELEMENT_X = 400 000 000
| ~ NELEMENT_Y = 10 000
| C
| ~ CALL MPI_TYPE_EXTENT(MPI_REAL8, ISIZE_REAL8, MPIerr)
| ~ MEM_X = NELEMENT_X * ISIZE_REAL8
| ~ MEM_Y = NELEMENT_Y * ISIZE_REAL8
| C
| C allocate memory
| C
| ~ CALL MPI_ALLOC_MEM( MEM_X, MPI_INFO_NULL, PTR1, MPIerr)
| ~ CALL MPI_ALLOC_MEM( MEM_Y, MPI_INFO_NULL, PTR2, MPIerr)
| C
| C fill vectors with 0.0D0 and 1.0D0
| C
| ~ CALL DZERO(XMAT,NELEMENT_X)
| ~ CALL DONE(YMAT,NELEMENT_Y)
| C
| C open memory window
| C
| ~ CALL MPI_WIN_CREATE( XMAT, MEM_X, ISIZE_REAL8,
| ~ & MPI_INFO_NULL, MPI_COMM_WORLD,
| ~ & MY_X_WIN, MPIerr )
| C lock window (MPI_LOCK_SHARED mode)
| C select target ==> if itarget == myid: no 1-sided communication
| C
| ~ ITARGET = MYID
| ~ CALL MPI_WIN_LOCK( MPI_LOCK_SHARED, ITARGET, MPI_MODE_NOCHECK,
| ~ & MY_X_WIN, MPIerr)
| C
| C transfer data to target ITARGET
| C
| ~ JCOUNT_T = 10 000
| ~ JCOUNT = JCOUNT_T
| C set displacement in memory window
| ~ IDISPL_WIN = 300 000 000
| C
| ~ CALL MPI_ACCUMULATE( YMAT, JCOUNT, MPI_REAL8, ITARGET, IDISPL_WIN,
| ~ & JCOUNT_T, MPI_REAL8, MPI_SUM, MY_X_WIN, MPIerr)
| C
| C unlock
| C
| ~ CALL MPI_WIN_UNLOCK( ITARGET, MY_X_WIN, MPIerr)
| ...
|
| The complete source code (accumulate_test.F) is attached to this
| e-mail as well as the
| config.log of my OpenMPI installation.
|
| The program only(!) fails for values of IDISPL_WIN > 268 435 455!!!
| For all lower
| offset values it finishes normally.
|
| Therefore, I assume that after the internal multiplication (in
| MPI_ACCUMULATE)
| of IDISPL_WIN with the window scaling factor ISIZE_REAL8 (== 8 byte) an
| INTEGER(*4) overflow occurs, although IDISPL_WIN is declared as
| KIND=MPI_ADDRESS_KIND (INTEGER*8). Might that be the reason?
|
| Running this program doing rather MPI_GET than MPI_ACCUMULATE with
| the same offsets is no problem at all.
|
| Thanks in advance for any help,
|
| stefan
|
| --
| -------------------------------------------
| Dipl. Chem. Stefan Knecht
| Institute for Theoretical and
| Computational Chemistry
| Heinrich-Heine University D?sseldorf
| Universit?tsstra?e 1
| Building 26.32 Room 03.33
| 40225 D?sseldorf
|
| phone: +49-(0)211-81-11439
| e-mail: stefan_at_[hidden]
| http://www.theochem.uni-duesseldorf.de/users/stefan
|
|
|>
|>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHsEgqFgKivGtHXsARAixkAJ4k7yIyBl2ARp4j4syshVLBZ5xawQCgip7D
90VoNj9YO0UvF2CrMZXkF8s=
=nLCs
-----END PGP SIGNATURE-----