Many thanks for trans-coding to C; this was a major help in debugging the issue.
Thankfully, it turned out to be a simple bug. OMPI's parameter checking for MPI_ALLGATHERV was using the *local* group size when checking the recvcounts parameter, where it really should have been using the *remote* group size. So when the local group size > the remote group size, Bad Things could happen.
For this test, the bad case would only happen with odd numbers of processes. It probably only happens sometimes because the contents of memory after the recvcounts array are undefined -- sometimes they'll be ok, sometimes they won't.
I fixed the issue in https://svn.open-mpi.org/trac/ompi/changeset/26488 and filed to move it to 1.6.1 in https://svn.open-mpi.org/trac/ompi/ticket/3105.
Many thanks for reporting the issue!
On May 23, 2012, at 10:30 PM, Jonathan Dursi wrote:
> On 23 May 9:37PM, Jonathan Dursi wrote:
>> On the other hand, it works everywhere if I pad the rcounts array with
>> an extra valid value (0 or 1, or for that matter 783), or replace the
>> allgatherv with an allgather.
> .. and it fails with 7 even where it worked (but succeeds with 8) if I pad rcounts with an extra invalid value which should never be read.
> Should the recvcounts parameters test in allgatherv.c loop up to size=ompi_comm_remote_size(comm), as is done in alltoallv.c, rather than ompi_comm_size(comm) ? That seems to avoid the problem.
> - Jonathan
> Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca
> users mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/