Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Bug report: non-blocking allreduce with user-defined operation gives segfault
From: Rupert Nash (rupert.nash_at_[hidden])
Date: 2014-04-24 08:32:57


('binary' encoding is not supported, stored as-is)

Hi George,

Having looked again you're correct about the two 2buf reductions being wrong. For now, I've updated my patch of nbc.c to copy buf1 into buf3 and then do buf3 OP= buf2 (see below).

Patching ompi_3buff_op_reduce to cope with user-defined operations is certainly possible, but I don't really understand the implications of doing that for the rest of the codebase (this is the first time I've looked at the internals of OpenMPI).

Best,
Rupert

        if (ompi_op_is_intrinsic(opargs.op)) {
          /* This does buf3 = buf1 OP buf2 */
          ompi_3buff_op_reduce(opargs.op, buf1, buf2, buf3, opargs.count, opargs.datatype);
        } else {
          /* Copy buf1 -> buf3 (if necessary)
           * then do buf3 OP= buf2
           * If the output is the same as the first input, we don't need to copy
           * This only applies to the second if the operator commutes */
          if (buf1 == buf3) {
            ompi_op_reduce(opargs.op, buf2, buf3, opargs.count, opargs.datatype);
          } else if (buf2 == buf3 && ompi_op_is_commute(opargs.op)) {
            ompi_op_reduce(opargs.op, buf1, buf3, opargs.count, opargs.datatype);
          } else {
            res = NBC_Copy(buf1, opargs.count, opargs.datatype, buf3, opargs.count, opargs.datatype, handle->comm);
            if(res != NBC_OK) { printf("NBC_Copy() failed (code: %i)\n", res); ret=res; goto error; }
            ompi_op_reduce(opargs.op, buf2, buf3, opargs.count, opargs.datatype);
          }
        }

> Rupert,
>
> You are right, the code of any non-blocking reduce is not built with
> user-level op in mind. However, I'm not sure about your patch. One
> reason is that ompi_3buff is doing target = source1 op source2 while
> ompi_2buf is doing target op= source (notice the op=)
>
> Thus you can't replace ompi_3buff by 2 ompi_2buff because you
> basically replace target = source1 op source2 by target op= source1 op
> source2
>
> Moreover, I much nicer solution will be to patch directly the
> ompi_3buff_op_reduce function in op.h to fallback to a user defined
> function when necessary.
>
> George.

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.