Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Alltoall with Vector Datatype
From: Spenser Gilliland (spenser_at_[hidden])
Date: 2014-05-08 16:02:52


Matthieu & George,

Thanks you both for helping me. I really appreciate it.

> A simple test would be to run it with valgrind, so that out of bound
> reads and writes will be obvious.

I ran it through valgrind (i left the command line I used in there so
you can verify the methods)

I am getting errors with valgrind inside the Alltoall function. See
https://gist.github.com/anonymous/fbd83343f456f0688cea .

These errors do not occur in the stack allocated version. See
https://gist.github.com/anonymous/f4dbcddbbc9fee0f508e . I assume
this is due to valgrind being unable to detect stack corruption.

> The segfault indicates that you overwrite outside of the allocated memory (and conflicts with the ptmalloc
> library). I’m quite certain that you write outside the allocated array …

So, my understanding is that Alltoall would utilize wsize multiplied
by the size of the data type indexes of the matrix and jump extent
bytes between communications. Thus, I'm very lost as to why the
Alltoall is exceeding the bounds of my array.

Thanks,
Spenser

On Thu, May 8, 2014 at 2:19 PM, Matthieu Brucher
<matthieu.brucher_at_[hidden]> wrote:
> A simple test would be to run it with valgrind, so that out of bound
> reads and writes will be obvious.
>
> Cheers,
>
> Matthieu
>
> 2014-05-08 21:16 GMT+02:00 Spenser Gilliland <spenser_at_[hidden]>:
>> George & Mattheiu,
>>
>>> The Alltoall should only return when all data is sent and received on
>>> the current rank, so there shouldn't be any race condition.
>>
>> Your right this is MPI not pthreads. That should never happen. Duh!
>>
>>> I think the issue is with the way you define the send and receive
>>> buffer in the MPI_Alltoall. You have to keep in mind that the
>>> all-to-all pattern will overwrite the entire data in the receive
>>> buffer. Thus, starting from a relative displacement in the data (in
>>> this case matrix[wrank*wrows]), begs for troubles, as you will write
>>> outside the receive buffer.
>>
>> The submatrix corresponding to matrix[wrank*wrows][0] to
>> matrix[(wrank+1)*wrows-1][:] is valid only on the wrank process. This
>> is a block distribution of the rows like what MPI_Scatter would
>> produce. As wrows is equal to N (matrix width/height) divided by
>> wsize, the number of mpi_all_t blocks in each message is equal to
>> wsize. Therefore, there should be no writing outside the bounds of
>> the submatrix.
>>
>> On another note,
>> I just ported the example to use dynamic memory and now I'm getting
>> segfaults when I call MPI_Finalize(). Any idea what in the code could
>> have caused this?
>>
>> It's paste binned here: https://gist.github.com/anonymous/a80e0679c3cbffb82e39
>>
>> The result is
>>
>> [sgillila_at_jarvis src]$ mpirun -npernode 2 transpose2 8
>> N = 8
>> Matrix =
>> 0: 0 1 2 3 4 5 6 7
>> 0: 8 9 10 11 12 13 14 15
>> 0: 16 17 18 19 20 21 22 23
>> 0: 24 25 26 27 28 29 30 31
>> 1: 32 33 34 35 36 37 38 39
>> 1: 40 41 42 43 44 45 46 47
>> 1: 48 49 50 51 52 53 54 55
>> 1: 56 57 58 59 60 61 62 63
>> Matrix =
>> 0: 0 8 16 24 32 40 48 56
>> 0: 1 9 17 25 33 41 49 57
>> 0: 2 10 18 26 34 42 50 58
>> 0: 3 11 19 27 35 43 51 59
>> 1: 4 12 20 28 36 44 52 60
>> 1: 5 13 21 29 37 45 53 61
>> 1: 6 14 22 30 38 46 54 62
>> 1: 7 15 23 31 39 47 55 63
>> [jarvis:09314] *** Process received signal ***
>> [jarvis:09314] Signal: Segmentation fault (11)
>> [jarvis:09314] Signal code: Address not mapped (1)
>> [jarvis:09314] Failing at address: 0x21da228
>> [jarvis:09314] [ 0] /lib64/libpthread.so.0() [0x371480f500]
>> [jarvis:09314] [ 1]
>> /opt/openmpi/lib/libmpi.so.1(opal_memory_ptmalloc2_int_free+0x75)
>> [0x7f2e85452575]
>> [jarvis:09314] [ 2]
>> /opt/openmpi/lib/libmpi.so.1(opal_memory_ptmalloc2_free+0xd3)
>> [0x7f2e85452bc3]
>> [jarvis:09314] [ 3] transpose2(main+0x160) [0x4012a0]
>> [jarvis:09314] [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3713c1ecdd]
>> [jarvis:09314] [ 5] transpose2() [0x400d49]
>> [jarvis:09314] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 9314 on node
>> jarvis.cs.iit.edu exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>>
>> --
>> Spenser Gilliland
>> Computer Engineer
>> Doctoral Candidate
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> Music band: http://liliejay.com/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Spenser Gilliland
Computer Engineer
Doctoral Candidate