Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] bug in MPI_ACCUMULATE for window offsets > 2**31 - 1 bytes? openmpi v1.2.5
From: Tim Prins (tprins_at_[hidden])
Date: 2008-02-07 13:47:15


The fix I previously sent to the list has been committed in r17400.

Thanks,

Tim

Tim Prins wrote:
> Hi Stefan,
>
> I was able to verify the problem. Turns out this is a problem with other
> onesided operations as well. Attached is a simple test case I made in c
> using MPI_Put that also fails.
>
> The problem is that the target count and displacements are both sent as
> signed 32 bit integers. Then, the receiver multiplies them together and
> adds them to the window base. However, this multiplication is done using
> the signed 32 bit integers, which overflows. This is then added to the
> 64 bit pointer. This, of course, results in a bad address.
>
> I have attached a patch against a recent development version that fixes
> this for me. I am also copying Brian Barrett, who did all the work on
> the onesided code.
>
> Brian: if possible, please take a look at the attached patch and test case.
>
> Thanks for the report!
>
> Tim Prins
>
> Stefan Knecht wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hi all,
>>
>> I encounter a problem with the routine MPI_ACCUMULATE trying to sum up
>> MPI_REAL8's on a large memory window with a large offset.
>> My program running (on a single processor, x86_64 architecture)
>> crashes with
>> an error message like:
>>
>> [node14:16236] *** Process received signal ***
>> [node14:16236] Signal: Segmentation fault (11)
>> [node14:16236] Signal code: Address not mapped (1)
>> [node14:16236] Failing at address: 0x2aaa32b16000
>> [node14:16236] [ 0] /lib64/libpthread.so.0 [0x32e080de00]
>> [node14:16236] [ 1]
>> /home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(ompi_mpi_op_sum_double+0x10)
>> [0x2aaaaaf15530]
>> [node14:16236] [ 2]
>> /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_process_op+0x2d7)
>>
>> [0x2aaab1a47257]
>> [node14:16236] [ 3]
>> /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so
>> [0x2aaab1a45432]
>> [node14:16236] [ 4]
>> /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_passive_unlock+0x93)
>>
>> [0x2aaab1a48243]
>> [node14:16236] [ 5]
>> /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so
>> [0x2aaab1a43436]
>> [node14:16236] [ 6]
>> /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_progress+0xff)
>>
>> [0x2aaab1a42e0f]
>> [node14:16236] [ 7]
>> /home/stefan/bin/openmpi-1.2.5/lib/libopen-pal.so.0(opal_progress+0x4a)
>> [0x2aaaab3dfa0a]
>> [node14:16236] [ 8]
>> /home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_module_unlock+0x2a9)
>>
>> [0x2aaab1a48629]
>> [node14:16236] [ 9]
>> /home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(PMPI_Win_unlock+0xe1)
>> [0x2aaaaaf4a291]
>> [node14:16236] [10]
>> /home/stefan/bin/openmpi-1.2.5/lib/libmpi_f77.so.0(mpi_win_unlock_+0x25)
>> [0x2aaaaacdd8c5]
>> [node14:16236] [11] /home/stefan/calc/mpi2_test/a.out(MAIN__+0x809)
>> [0x401851]
>> [node14:16236] [12] /home/stefan/calc/mpi2_test/a.out(main+0xe)
>> [0x401bbe]
>> [node14:16236] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x32dfc1dab4]
>> [node14:16236] [14] /home/stefan/calc/mpi2_test/a.out [0x400f99]
>> [node14:16236] *** End of error message ***
>> mpirun noticed that job rank 0 with PID 16236 on node node14 exited on
>> signal 11 (Segmentation fault).
>>
>>
>> The relevant part of my FORTRAN source code reads as:
>>
>> ~ program accumulate_test
>> ~ IMPLICIT REAL*8 (A-H,O-Z)
>> ~ include 'mpif.h'
>> ~ INTEGER(KIND=MPI_OFFSET_KIND) MX_SIZE_M
>> C dummy size parameter
>> ~ PARAMETER (MX_SIZE_M = 1 000 000)
>> ~ INTEGER MPIerr, MYID, NPROC
>> ~ INTEGER ITARGET, MY_X_WIN, JCOUNT, JCOUNT_T
>> ~ INTEGER(KIND=MPI_ADDRESS_KIND) MEM_X, MEM_Y
>> ~ INTEGER(KIND=MPI_ADDRESS_KIND) IDISPL_WIN
>> ~ INTEGER(KIND=MPI_ADDRESS_KIND) PTR1, PTR2
>> ~ INTEGER(KIND=MPI_INTEGER_KIND) ISIZE_REAL8
>> ~ INTEGER*8 NELEMENT_X, NELEMENT_Y
>> ~ POINTER (PTR1, XMAT(MX_SIZE_M))
>> ~ POINTER (PTR2, YMAT(MX_SIZE_M))
>> C
>> ~ CALL MPI_INIT( MPIerr )
>> ~ CALL MPI_COMM_RANK( MPI_COMM_WORLD, MYID, MPIerr)
>> ~ CALL MPI_COMM_SIZE( MPI_COMM_WORLD, NPROC, MPIerr)
>> C
>> ~ NELEMENT_X = 400 000 000
>> ~ NELEMENT_Y = 10 000
>> C
>> ~ CALL MPI_TYPE_EXTENT(MPI_REAL8, ISIZE_REAL8, MPIerr)
>> ~ MEM_X = NELEMENT_X * ISIZE_REAL8
>> ~ MEM_Y = NELEMENT_Y * ISIZE_REAL8
>> C
>> C allocate memory
>> C
>> ~ CALL MPI_ALLOC_MEM( MEM_X, MPI_INFO_NULL, PTR1, MPIerr)
>> ~ CALL MPI_ALLOC_MEM( MEM_Y, MPI_INFO_NULL, PTR2, MPIerr)
>> C
>> C fill vectors with 0.0D0 and 1.0D0
>> C
>> ~ CALL DZERO(XMAT,NELEMENT_X)
>> ~ CALL DONE(YMAT,NELEMENT_Y)
>> C
>> C open memory window
>> C
>> ~ CALL MPI_WIN_CREATE( XMAT, MEM_X, ISIZE_REAL8,
>> ~ & MPI_INFO_NULL, MPI_COMM_WORLD,
>> ~ & MY_X_WIN, MPIerr )
>> C lock window (MPI_LOCK_SHARED mode)
>> C select target ==> if itarget == myid: no 1-sided communication
>> C
>> ~ ITARGET = MYID
>> ~ CALL MPI_WIN_LOCK( MPI_LOCK_SHARED, ITARGET, MPI_MODE_NOCHECK,
>> ~ & MY_X_WIN, MPIerr)
>> C
>> C transfer data to target ITARGET
>> C
>> ~ JCOUNT_T = 10 000
>> ~ JCOUNT = JCOUNT_T
>> C set displacement in memory window
>> ~ IDISPL_WIN = 300 000 000
>> C
>> ~ CALL MPI_ACCUMULATE( YMAT, JCOUNT, MPI_REAL8, ITARGET, IDISPL_WIN,
>> ~ & JCOUNT_T, MPI_REAL8, MPI_SUM, MY_X_WIN, MPIerr)
>> C
>> C unlock
>> C
>> ~ CALL MPI_WIN_UNLOCK( ITARGET, MY_X_WIN, MPIerr)
>> ...
>>
>> The complete source code (accumulate_test.F) is attached to this
>> e-mail as well as the
>> config.log of my OpenMPI installation.
>>
>> The program only(!) fails for values of IDISPL_WIN > 268 435 455!!!
>> For all lower
>> offset values it finishes normally.
>>
>> Therefore, I assume that after the internal multiplication (in
>> MPI_ACCUMULATE)
>> of IDISPL_WIN with the window scaling factor ISIZE_REAL8 (== 8 byte) an
>> INTEGER(*4) overflow occurs, although IDISPL_WIN is declared as
>> KIND=MPI_ADDRESS_KIND (INTEGER*8). Might that be the reason?
>>
>> Running this program doing rather MPI_GET than MPI_ACCUMULATE with
>> the same offsets is no problem at all.
>>
>> Thanks in advance for any help,
>>
>> stefan
>>
>> - --
>> - -------------------------------------------
>> Dipl. Chem. Stefan Knecht
>> Institute for Theoretical and
>> Computational Chemistry
>> Heinrich-Heine University Düsseldorf
>> Universitätsstraße 1
>> Building 26.32 Room 03.33
>> 40225 Düsseldorf
>>
>> phone: +49-(0)211-81-11439
>> e-mail: stefan_at_[hidden]
>> http://www.theochem.uni-duesseldorf.de/users/stefan
>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFHmcaQFgKivGtHXsARAqG2AJ9xjTXKs5+Y3hoNd0g93Ue3ceFnUACdEmQN
>> MyOMP2fGCOEzrTwaNZAWPsA=
>> =P17R
>> -----END PGP SIGNATURE-----
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users