Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] maximum value for count argument
From: Martin Siegert (siegert_at_[hidden])
Date: 2009-11-10 20:19:45


Hi,

I have a problem with sending/receiving large buffers when using
openmpi (version 1.3.3), e.g.,

MPI_Allreduce(sbuf, rbuf, count, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

with count=180000000 (this problem does not appear to be unique for
Allreduce, but occurs with Reduce, Bcats as well; maybe more).
Initially I thought the maximum value for count would be 2^31-1
because count is an int. However, when using MPICH2 I receive a
segfault already when count=2^31/8 thus I suspect that they transfer
bytes instead of doubles internally and the count for the # of bytes
wraps around at that value. This I can deal with (it is not nice,
but I can wrap all calls such that as soon as count > 268435456
several calls are made).

Hwoever, with openmpi I just cannot figure out what the largest
permitted value is: in most cases the MPI calls hang for
count > 176763240, but this is not completely reproducable. This
appears to depend on the history, i.e., what other MPI routines
have been called before that.
>From looking at the code as far as I understand the MPICH2 problem
should not appear for openmpi: the allreduce call is split up into
several calls anyway - see the loop

for (phase = 0; phase < num_phases; phase ++) {
...
}

in coll_tuned_allreduce.c. In fact that loop is executed just fine.
The "hang" occurs when ompi_coll_tuned_sendrecv is called
(line 839 of coll_tuned_allreduce.c). Here is the call of that function:

(gdb) s
ompi_coll_tuned_sendrecv_actual (sendbuf=0x2aab2d539410, scount=90000000,
    sdatatype=0x602530, dest=1, stag=-12, recvbuf=0x2aab02694010,
    rcount=90000000, rdatatype=0x602530, source=1, rtag=-12, comm=0x602730,
    status=0x0) at coll_tuned_util.c:41

and the program just hangs as soon as ompi_request_wait_all (line 55 of
coll_tuned_util.c) is executed.

Any ideas how to fix this?

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert_at_[hidden]
Canada  V5A 1S6