Please read below:
> On Jan 12, 2009, at 2:50 PM, kmuriki_at_[hidden] wrote:
>> Is there is any requirement on the size of the data buffers
>> I should use in these warmup broadcasts ? If I use small
>> buffers like 1000 real values during warmup, the following
>> actual and timed MPI_BCAST over IB is taking a lot of time
>> (more than that on GiGE). If I use a bigger buffer of 10000 real
>> values during warmup the following timed MPI_BCAST is quick.
> I can't quite grok that -- "actual and timed MPI_BCAST"; are you talking
> about 2 different bcasts?
No I meant the same bcast when I said actual and timed.
This is the main bcast in the program which I have timed and
before this bcast as you suggested I did one warmup
bcast and in each attempt I picked the size of warmup bcast
from 1000 real to 10000 real values.
> With IB, there's also the issue of registered memory. Open MPI v1.2.x
> defaults to copy in/copy out semantics (with pre-registered memory) until the
> message reaches a certain size, and then it uses a pipelined register/RDMA
> protocol. However, even with copy in/out semantics of small messages, the
> resulting broadcast should still be much faster than over gige.
> Are you using the same buffer for the warmup bcast as the actual bcast? You
> might try using "--mca mpi_leave_pinned 1" to see if that helps as well (will
> likely only help with large messages).
I'm using different buffers for warmup and actual bcast. I tried the
mpi_leave_pinned 1, but did not see any difference in behaviour.
May be when ever the openmpi defaults to copy in/copy out semantics on my
cluster its performing very slow (than gige) but not when it uses RDMA.
Any tips on how to debug this !.
>> Surprisingly just doing two consecutive 80K byte MPI_BCASTs
>> performs very quick (forget about warmup and actual broadcast).
>> wheres as a single 80K broadcast is slow. Not sure if I'm missing
> There's also the startup time and synchronization issues. Remember that
> although MPI_BCAST does not provide any synchronization guarantees, it could
> well be that the 1st bcast effectively synchronizes the processes and the 2nd
> one therefore runs much faster (because individual processes won't need to
> spend much time blocking waiting for messages because they're effectively
> operating in lock step after the first bcast).
> Benchmarking is a very tricky business; it can be extremely difficult to
> precisely measure exactly what you want to measure.
My main effort here is not to benchmark my cluster but to resolve a
user problem, where in he complained that his bcasts are running very slow. I
tried to recreate the situation with a simple fortran program
which just performs a bcast of size similar in his code. It also performed
very slow (than gige) then I started increasing and decreasing the sizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.