On Jan 13, 2009, at 3:32 PM, kmuriki_at_[hidden] wrote:
>> With IB, there's also the issue of registered memory. Open MPI
>> v1.2.x defaults to copy in/copy out semantics (with pre-registered
>> memory) until the message reaches a certain size, and then it uses
>> a pipelined register/RDMA protocol. However, even with copy in/out
>> semantics of small messages, the resulting broadcast should still
>> be much faster than over gige.
>> Are you using the same buffer for the warmup bcast as the actual
>> bcast? You might try using "--mca mpi_leave_pinned 1" to see if
>> that helps as well (will likely only help with large messages).
> I'm using different buffers for warmup and actual bcast. I tried the
> mpi_leave_pinned 1, but did not see any difference in behaviour.
In this case, you likely won't see much of a difference --
mpi_leave_pinned will generally only be a boost for long messages that
use the same buffers repeatedly.
> May be when ever the openmpi defaults to copy in/copy out semantics
> on my
> cluster its performing very slow (than gige) but not when it uses
That would be quite surprising. I still think there's some kind of
startup overhead going on here.
>>> Surprisingly just doing two consecutive 80K byte MPI_BCASTs
>>> performs very quick (forget about warmup and actual broadcast).
>>> wheres as a single 80K broadcast is slow. Not sure if I'm missing
>> There's also the startup time and synchronization issues. Remember
>> that although MPI_BCAST does not provide any synchronization
>> guarantees, it could well be that the 1st bcast effectively
>> synchronizes the processes and the 2nd one therefore runs much
>> faster (because individual processes won't need to spend much time
>> blocking waiting for messages because they're effectively operating
>> in lock step after the first bcast).
>> Benchmarking is a very tricky business; it can be extremely
>> difficult to precisely measure exactly what you want to measure.
> My main effort here is not to benchmark my cluster but to resolve a
> user problem, where in he complained that his bcasts are running
> very slow. I tried to recreate the situation with a simple fortran
> which just performs a bcast of size similar in his code. It also
> very slow (than gige) then I started increasing and decreasing the
> of bcast to observe that it performs slow only in the range 8K bytes
> to 100K bytes.
Can you send your modified test program (with a warmup send)?
What happens if you run a benchmark like the broadcast section of IMB
on TCP and IB?