On Sun, 12 Aug 2007, Gleb Natapov wrote:
> > Any objections? We can discuss what approaches we want to take
> > (there's going to be some complications because of the PML driver,
> > etc.); perhaps in the Tuesday Mellanox teleconf...?
> My main objection is that the only reason you propose to do this is some
> bogus benchmark? Is there any other reason to implement header caching?
> I also hope you don't propose to break layering and somehow cache PML headers
> in BTL.
Gleb is hitting the main points I wanted to bring up. We had
examined this header caching in the context of PSM a little while
ago. 0.5us is much more than we had observed -- at 3GHz, 0.5us would
be about 1500 cycles of code that has little amounts of branches.
For us, with a much bigger header and more fields to fetch from
different structures, it was more like 350 cycles which is on the
order of 0.1us and not worth the effort (in code complexity,
readability and frankly motivation for performance). Maybe there's
more to it than just "code caching" -- like sending from pre-pinned
headers or using the RDMA with immediate, etc. But I'd be suprised
to find out that openib btl doesn't do the best thing here.
I have pretty good evidence that for CM, the latency difference comes
from the receive-side (in particular opal_progress). Doesn't the
openib btl receive-side do something similiar with opal_progress,
i.e. register a callback function? It probably does something
different like check a few RDMA mailboxes (or per-peer landing pads)
but anything that gets called before or after it as part of
opal_progress is cause for slowdown.
. . christian
(QLogic Host Solutions Group, formerly Pathscale)