On Jan 12, 2009, at 2:50 PM, kmuriki_at_[hidden] wrote:
> Is there is any requirement on the size of the data buffers
> I should use in these warmup broadcasts ? If I use small
> buffers like 1000 real values during warmup, the following
> actual and timed MPI_BCAST over IB is taking a lot of time
> (more than that on GiGE). If I use a bigger buffer of 10000 real
> values during warmup the following timed MPI_BCAST is quick.
I can't quite grok that -- "actual and timed MPI_BCAST"; are you
talking about 2 different bcasts?
With IB, there's also the issue of registered memory. Open MPI v1.2.x
defaults to copy in/copy out semantics (with pre-registered memory)
until the message reaches a certain size, and then it uses a pipelined
register/RDMA protocol. However, even with copy in/out semantics of
small messages, the resulting broadcast should still be much faster
than over gige.
Are you using the same buffer for the warmup bcast as the actual
bcast? You might try using "--mca mpi_leave_pinned 1" to see if that
helps as well (will likely only help with large messages).
> Surprisingly just doing two consecutive 80K byte MPI_BCASTs
> performs very quick (forget about warmup and actual broadcast).
> wheres as a single 80K broadcast is slow. Not sure if I'm missing
There's also the startup time and synchronization issues. Remember
that although MPI_BCAST does not provide any synchronization
guarantees, it could well be that the 1st bcast effectively
synchronizes the processes and the 2nd one therefore runs much faster
(because individual processes won't need to spend much time blocking
waiting for messages because they're effectively operating in lock
step after the first bcast).
Benchmarking is a very tricky business; it can be extremely difficult
to precisely measure exactly what you want to measure.