Thanx for your reply...
My coll_sync_priority is set to 50. See the dump of ompi_info --param
coll sync below...
Does sticking barriers hurt anything or is it just a cosmetic thing???
I'm fine with this solution...
$ompi_info --param coll sync
MCA coll: parameter "coll" (current value: <none>,
data source: default value)
Default selection set of components for the
coll framework (<none> means use all components that can be found)
MCA coll: parameter "coll_base_verbose" (current
value: "0", data source: default value)
Verbosity level for the coll framework (0 =
MCA coll: parameter "coll_sync_priority" (current
value: "50", data source: default value)
Priority of the sync coll component; only
relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before"
(current value: "1000", data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after"
(current value: "0", data source: default value)
Do a synchronization after each Nth collective
Quoting "Ralph Castain" <rhc_at_[hidden]>:
> Yeah, that is "normal". It has to do with unexpected messages.
> When you have procs running at significantly different speeds, the
> various operations get far enough out of sync that the memory
> consumed by recvd messages not yet processed grows too large.
> Instead of sticking barriers into your code, you can have OMPI do an
> internal sync after every so many operations to avoid the problem.
> This is done by enabling the "sync" collective component, and then
> adjusting the number of operations between forced syncs.
> Do an "ompi_info --params coll sync" to see the options. Then set
> the coll_sync_priority to something like 100 and it should work for
> On Nov 10, 2009, at 2:45 PM, Glembek OndÅej wrote:
>> I am using MPI_Reduce operation on 122880x400 matrix of doubles.
>> The parallel job runs on 32 machines, each having different
>> processor in terms of speed, but the architecture and OS is the
>> same on all machines (x86_64). The task is a typical
>> map-and-reduce, i.e. each of the processes collects some data,
>> which is then summed (MPI_Reduce w. MPI_SUM).
>> Having different processors, each of the jobs comes to the
>> MPI_Reduce in different time.
>> The *first problem* came when I called MPI_Reduce on the whole
>> matrix. The system ended up with *MPI_ERR_OTHER error*, each time
>> on different rank. I fixed this problem by chunking up the matrix
>> into 2048 submatrices, calling MPI_Reduce in cycle.
>> However *second problem* arose --- MPI_Reduce hangs up... It
>> apparently gets stuck in some kind of dead-lock or something like
>> that. It seems that if the processors are of similar speed, the
>> problem disappears, however I cannot provide this condition all the
>> I managed to get rid of the problem (at least after few
>> non-problematic iterations) by sticking MPI_Barrier before the
>> MPI_Reduce line.
>> The questions are:
>> 1) is this a usual behavior???
>> 2) is there some kind of timeout for MPI_Reduce???
>> 3) why does MPI_Reduce die on large amount of data if the system
>> has enough address space (64 bit compilation)
>> Ondrej Glembek
>> Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
>> Bozetechova 2, 612 66 Phone: +420 54114-1292
>> Brno, Czech Republic Fax: +420 54114-1290
>> ICQ: 93233896
>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C
>> users mailing list
> users mailing list
Ondrej Glembek, PhD student E-mail: glembek_at_[hidden]
UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek
Bozetechova 2, 612 66 Phone: +420 54114-1292
Brno, Czech Republic Fax: +420 54114-1290
GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C