Ring Buffer design
-
So I was reading about mirrored ring buffers. For a normal ring delay, we have a circular buffer of size (n) - a write pointer advances and wraps back around when it reaches the end. A read pointer advances at an offset to the write pointer to create the delay amount. This creates the 1-tap delay effect.
Fractional reads require that sometimes the window of samples required for interpolation crosses the end of the buffer. That window not being contiguous in memory means we need to modulo to access all of the indexes required for the fractional read. This costs CPU for every read operation.
The mirror trick:
Instead of storing [0 ... n-1] we allow a buffer twice as large. Every time we write a sample into w, we also write it into w + N. This results in a mirrored copy of the entire signal.
Even though we allow twice the buffer, our real delay line is still only the single length. Meaning when we read, we read in a straight block. So the interpolator doesn't need to check the boundaries or even do an & mask operation for every tap.
I've not yet implemented it, but think I grok the theory behind it. Costs extra memory, but can be much faster for doing multi-tap fractional interpolation, because you avoid modulo and branching in the inner loop.
I'm going to implement it as a constructor boolean I think, so that my RingDelay class is flexible.
-
@Orvillain said in Ring Buffer design:
because you avoid modulo and branching in the inner loop.
If your buffer size is a power of two (which it always can be, just round it up), then modulo boils down to a bitwise AND operation with even the lowest optimization settings. This vastly outperforms the performance impact of having a bigger buffer, so I'm not 100% sure this is worth the hassle.
-
@Christoph-Hart Ahh interesting!
Even on embedded systems like ARM Cortex (I'm using the DaisySeed chip) ??
Also yes, I enforce power of 2 buffer at all times.
-
Yep, if your buffer is a power of two you can use bitmasking.
The only case where I've used other strategies, is when my interpolator is very wide. My Fir convolution interpolator has 64 reads every sample, and so even bitmask wraps add up.
But in such situations I actually just use strategies to hoist the wraps. I don't do any checking at all inside the loop. I note down where the edges are and if we are nowhere near the edges we run a separate vectorized loop that avoids any checks. -
@Orvillain that's such a basic optimization technique that I would expect any compiler made after 1975 will incorporate this.
If you don't trust that, you can easily do this yourself:
buffer[i % 4096] buffer[i & 4095]
these are equivalent statements with integer numbers, one costs 88CPU cycles, and one is for free (~1 CPU cycle).
If you want a good read, this is the quasi standard reference for comparing CPU instructions. You need to dig around a bit to find what you're after, and again, you're basically doing yourself what the compiler will do for you already so it's purely educational:
https://www.agner.org/optimize/instruction_tables.pdf
the assembly instruction for division (and modulo) is
IDIV
which clocks at 88 CPU cycles.AND
takes a single CPU cycle. -
@Christoph-Hart Thank you!