Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] Bug report: non-blocking allreduce with user-defined operation gives segfault
From: Rupert Nash (rupert.nash_at_[hidden])
Date: 2014-04-23 12:52:08


Hello devel list

I've been trying to use a non-blocking MPI_Iallreduce in a CFD application I'm working on, but it kept segfaulting on me. I have reduced it to a simple test case - see the gist here for the full code
        https://gist.github.com/rupertnash/11222282
build and run with:
        mpicc test.c -o test && mpirun -n 2 ./test

I am working on OS X Mavericks with open-mpi 1.8 built from the source tarball.

Through some debugging I have narrowed the problem down:
In ompi/mca/coll/libnbc/nbc.c, in NBC_Start_round, where the code switches on which type of operation has been put in the schedule:

      case OP:
        NBC_DEBUG(5, " OP (offset %li) ", (long)ptr-(long)myschedule);
        NBC_GET_BYTES(ptr,opargs);
        NBC_DEBUG(5, "*buf1: %p, buf2: %p, count: %i, type: %lu)\n", opargs.buf1, opargs.buf2, opargs.count, (unsigned long)opargs.datatype);
        /* get buffers */
        /* SNIP */
---> ompi_3buff_op_reduce(opargs.op, buf1, buf2, buf3, opargs.count, opargs.datatype);
        break;

The line marked with an arrow --> is the problem. Looking at the comments describing ompi_3buff_op_reduce, it states "This function will *only* be invoked on intrinsic MPI_Ops." Examining the code bears this out as it's clearly indexing into a table of function pointers, which are all null for a user-defined MPI_Op.

Presumably the fix will be to replace the use of the 3buffer version with the usual ompi_op_reduce, at least of non-intrinsic operations. I have made a temporary patch by replacing the arrowed line with the following:
        if (0 != (opargs.op->o_flags & OMPI_OP_FLAGS_INTRINSIC)) {
          ompi_3buff_op_reduce(opargs.op, buf1, buf2, buf3, opargs.count, opargs.datatype);
        } else {
          ompi_op_reduce(opargs.op, buf1, buf3, opargs.count, opargs.datatype);
          ompi_op_reduce(opargs.op, buf2, buf3, opargs.count, opargs.datatype);
        }
However this is the first time I've looked under the hood of OpenMPI. Hopefully you can patch it properly soon.

Best wishes,

Rupert

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.