Hi,
I have a problem with sending/receiving large buffers when using
openmpi (version 1.3.3), e.g.,
MPI_Allreduce(sbuf, rbuf, count, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
with count=180000000 (this problem does not appear to be unique for
Allreduce, but occurs with Reduce, Bcats as well; maybe more).
Initially I thought the maximum value for count would be 2^31-1
because count is an int. However, when using MPICH2 I receive a
segfault already when count=2^31/8 thus I suspect that they transfer
bytes instead of doubles internally and the count for the # of bytes
wraps around at that value. This I can deal with (it is not nice,
but I can wrap all calls such that as soon as count > 268435456
several calls are made).
Hwoever, with openmpi I just cannot figure out what the largest
permitted value is: in most cases the MPI calls hang for
count > 176763240, but this is not completely reproducable. This
appears to depend on the history, i.e., what other MPI routines
have been called before that.
>From looking at the code as far as I understand the MPICH2 problem
should not appear for openmpi: the allreduce call is split up into
several calls anyway - see the loop
for (phase = 0; phase < num_phases; phase ++) {
...
}
in coll_tuned_allreduce.c. In fact that loop is executed just fine.
The "hang" occurs when ompi_coll_tuned_sendrecv is called
(line 839 of coll_tuned_allreduce.c). Here is the call of that function:
(gdb) s
ompi_coll_tuned_sendrecv_actual (sendbuf=0x2aab2d539410, scount=90000000,
sdatatype=0x602530, dest=1, stag=-12, recvbuf=0x2aab02694010,
rcount=90000000, rdatatype=0x602530, source=1, rtag=-12, comm=0x602730,
status=0x0) at coll_tuned_util.c:41
and the program just hangs as soon as ompi_request_wait_all (line 55 of
coll_tuned_util.c) is executed.
Any ideas how to fix this?
Cheers,
Martin
--
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services phone: 778 782-4691
Simon Fraser University fax: 778 782-4242
Burnaby, British Columbia email: siegert_at_[hidden]
Canada V5A 1S6
|