Ring Buffer design

Christoph Hart

@ustk most likely missing knowledge due to sparse documentation :)

Efficiency wise it shouldn't make much of a difference, in the end the compiler will optimize this pretty well and all the index types in SNEX are templated so their interpolation can be derived on compile time.

I haven't played around with fixed-point math at all so I can't say whether that's much faster than using floating points, but there are a few things in the optimization notes from @griffinboy that are redundant:

inline is basically worthless in C++ by now. Every compiler will inline small functions like this even with the lowest optimization settings.
there's already a juce::nextPowerOfTwo() which contains the same code without the if branch which seems unnecessary.
using std::fmaf for a simple task like a*b + c seems to complicate the code and make it slower. Here's the godbolt output for this C++ code that compares it vs. its naive implementation:

#include <cmath>


float raw(float a, float b, float c)
{
    return a * b + c;
}

float wrapped(float a, float b, float c)
{
    return std::fmaf(a, b, c);
}

int main()
{
    return 0;
}

x64 assembly output for MSVC with -O3:

float raw(float,float,float) PROC                           ; raw
        movss   DWORD PTR [rsp+24], xmm2
        movss   DWORD PTR [rsp+16], xmm1
        movss   DWORD PTR [rsp+8], xmm0
        movss   xmm0, DWORD PTR a$[rsp]
>>      mulss   xmm0, DWORD PTR b$[rsp]
>>      addss   xmm0, DWORD PTR c$[rsp]
        ret     0

float wrapped(float,float,float) PROC                    
        movss   DWORD PTR [rsp+24], xmm2
        movss   DWORD PTR [rsp+16], xmm1
        movss   DWORD PTR [rsp+8], xmm0
        sub     rsp, 40                             ; 00000028H
        movss   xmm2, DWORD PTR c$[rsp]
        movss   xmm1, DWORD PTR b$[rsp]
        movss   xmm0, DWORD PTR a$[rsp]
>>      call    QWORD PTR __imp_fmaf
        add     rsp, 40                             ; 00000028H
        ret     0

You don't need to be able to fully understand or write assembly to extract valuable information out of that, the most obvious thing is more lines in the second function which means more time spent there. But it gets worse when you take a closer look: I've marked the relevant lines with >>. The first function boils down to two single instructions which are basically for free on a modern CPU while the wrapped function invokes a function call including copying the values into call registers etc. which is a order of magnitude slower than the first example.

TLDR: Start with the implementation that is the easiest to understand / write / debug, then profile or put isolated pieces into Godbolt to see what can be improved.

EDIT: I've messed up the compiler flags somehow (-O3 is a clang compiler flag, with MSVC it's -Ox, the fully optimized assembly looks like this:

float raw(float,float,float) PROC                           ; raw
        mulss   xmm0, xmm1
        addss   xmm0, xmm2
        ret     0

float wrapped(float,float,float) PROC                       ; wrapped
        jmp     fmaf

so it calls the inbuilt CPU instruction directly which indeed seems quicker, but I would still expect that this is not making much of a difference.

griffinboy

@Christoph-Hart

Thanks for the rebuke! Looks like I need to do my homework

Christoph Hart

@griffinboy no worries, this is very advanced stuff, and with DSP it's always good to have performance in mind but I really recommend to spend some time with godbolt after watching a 20 minute introduction video to assembly (just make sure to pick a certain CPU instruction set for this as the syntax varies between CPU types - I'm pretty familiar with x64 assembly which I've studied extensively when writing the SNEX optimizer, but the NEON instruction set for Apple Silicon CPUs is basically gibberish for me).

It helped me a lot getting rid of the anxiety of writing slow code - you can expect that the compilers are able to make these low-level type optimizations pretty well and seeing the actual instructions the C++ code spits out does confirm that in most cases without having to resort to profiling.

Christoph Hart

Here's another example:

Wrapping is now done with a single bitmask, no floor or % calls.

So your assumption is that if you know that the limit is a power of two you can skip the expensive modulo call (which is the same as a division) and resort to bit fiddling (a & (LIMIT - 1). This is theoretically true, but this example shows you that the compiler is as smart as you so there's no need to do this yourself

I admit I'm using this too a lot in the HISE codebase and it's not per se "bad coding style" (as it still speeds up the debug builds), but you don't leave performance on the table by not doing so - in fact the index::wrap function just uses the modulo operator and since you can give it a compile time upper limit it will benefit from this optimization for power-of two buffer sizes too.

So the code example calls the default modulo operator, but one time with a unknown variable and once with a guaranteed power of two. The first function call results in:

idiv    r8d

which is bad (idiv is division so slow). The second one results in:

dec     ecx
or      ecx, -32                            ; ffffffffffffffe0H
inc     ecx

which is more or less the bit fiddling formula above (I wouldn't try to understand what it does exactly, but a simple look at the instruction names hint at some very lightweight instructions).

griffinboy

@Christoph-Hart

Thanks for all this, it's super interesting.

I'm afraid I've turned this thread into an optimisation discussion, my fault whoops

ustk

@griffinboy all the responsibility is mine, not even contributing... But hell I'm glad I've done this lol learning a lot

Orvillain

Nice additions!

So the reason I started learning about bitwise operations is because eventually I'll want to run the same code (or at least port it over) to the DaisySeed platform, so performance is reasonably important. But I've not even really worried about trying to squeeze stuff into fixed point math.

I'm definitely curious though!

@griffinboy no worries at all!

Orvillain

@ustk As I understand it, SNEX is a subset of JUCE and c++ and doesn't have the full capabilities. I have used it for some simple waveshaping and I did also write a sample playback node in it, but I hit some snags that at the time I wasn't able to work around. But probably just a knowledge hole that hopefully by now I've filled!!

The last thing I wrote in it was a sample+hold effect.

Orvillain

@Christoph-Hart said in Ring Buffer design:

TLDR: Start with the implementation that is the easiest to understand / write / debug, then profile or put isolated pieces into Godbolt to see what can be improved.

This was my thought too - it isn't any good writing code if I come back to it a week later and can barely understand it. So I try to approach things in the easiest to understand way first, even if it is a bit knuckleheaded. You can always optimize the hell out of a for loop or array access, later on down the line.

Orvillain

@griffinboy said in Ring Buffer design:

I don't use cubic interpolators on my delay lines, I use FIR filters (using SIMD) using impulses which are equivalent to windowed sync interpolation.

Hey thanks dude for this. I went off on a ChatGPT research mission, and I think I'm starting to understand it. I've got it implemented anyway, it seems to work.

Here's how I'm understanding it, again with a bit of help!!!

....

Sinc uses a band-limited approach to reconstruct the original continuous signal. But computing a fresh sinc for every fractional position would be super expensive. So we precompute lots of sinc kernels and whack 'em in a table.

The table has a bunch of rows - phases.
Each row contains taps - the weights that nearby samples get multiplied by. The more taps you have gives you a better frequency response, but more CPU usage.

Each row is then a windowed sinc where 'k' is the taps integer offset from the centre sample, 'f' is the fractional position. You take a window function like Hann or Blackman-Harris, and that gentle fades the ends so that the limited sinc doesn't pre-ring.

After building the row, normalize it so it sums to 1. This keeps DC level stable.

Store all rows flat in a coeffs array, with some helpers to read the correct row - rowNearest and rowBlended.

Nearest gives you the closest phase row for the given fraction.
Blended gives the two neighbouring rows plus a blend factor, so you can crossfade across the two.

Then inside my RingDelay, my new 'getReadPos' function gives an integer index and a fractional part. This can be used with any of my interpolation methods (nearest, linear, cubic, lagrange3, and now sinc)

readSinc uses nearest. readSincBlended uses blended.

For each tap, lookup the neighbouring samples around i1, and multiply by the weights. Sum them up, and that gives the interpolated value.

So Cubic or Lagrange - uses 4 samples, and looks smooth. But is not truly band limited. It rolls off highs in the pass band, so things sound slightly dull. Stop band rejection is not great, so it can alias or cause whistle tones.

Windowed Sinc - approximates the ideal reconstruction. Much flatter passband so it keeps brightness. Much stronger stopband attenuation, so there is less aliasing. It is also linear phase, so transients don't get skewed. When you sweep the delay time, it stays clean and consistent. But it costs more CPU.

griffinboy

@Orvillain

That's exactly it.
And the cool thing about using a fir interpolator like this with multiple taps, is that you can SIMD it
Process 4 taps in one SIMD instruction cut CPU significantly.

Hise has XSIMD included.
You don't need to include it you can just use it in a c++ node

Oh but it has latency potentially to use this technique. So it's not good for certain uses where it's too fiddly to compensate latency in different places.
But it's good for delay effects and for reading samples from a buffer at a new speed.

Orvillain

So yeah, what I've implemented in my ring delay are the following:

Read Nearest: uses 1 sample. It is the worst quality in terms of aliasing and modulation smoothness. But the lowest CPU usage.
Read Linear: Uses 2 samples. It has audible HF loss, the aliasing is poor, but the modulation is a level above nearest. CPU usage is still low.
Read Cubic: Uses 4 samples. Somewhat noticeable HF roll-off, aliasing performance is fair, and the modulation is surprisingly smooth. Medium CPU usage.
Read Lagrange3: Uses 4 samples. Better HF retention than cubic, but still rolls off. Aliasing rejection is okay, but does let some junk through. Smooth modulation. Medium CPU usage.
Read Sinc: Has 16 taps, which means it takes 16 input samples around the read point and does 16 FMA's to produce one output sample. The passband is very flat, no HF loss. Aliasing performance is excellent. Modulation is excellent. CPU is relatively high - but here on my machine never spiked above 1%.
Read Sinc Blended: Also takes 16 input samples, but because we have a blend it performs 32 FMA's to produce one output sample. The passband is very flat, no HF loss. Aliasing performance is excellent. Modulation is excellent. CPU is relatively high - but here on my machine never spiked above 2%.

It will be interesting when I get this stuff onto an embedded platform like the DaisySeed (ARM Cortex based) and then I can see how it all works out CPU-wise.

Orvillain

@griffinboy said in Ring Buffer design:

@Orvillain

That's exactly it.
And the cool thing about using a fir interpolator like this with multiple taps, is that you can SIMD it
Process 4 taps in one SIMD instruction cut CPU significantly.

Hise has XSIMD included.
You don't need to include it you can just use it in a c++ node

Oh but it has latency potentially to use this technique. So it's not good for certain uses where it's too fiddly to compensate latency in different places.
But it's good for delay effects and for reading samples from a buffer at a new speed.

Indeed! Indeed! I haven't got to any of this yet. Just using basic c++ loops without any SIMD or vectorization.

I'm looking at this purely for chorus or modulation effects, delays, reverbs, and sample playback.

The reason I've implemented all of these is because I want to experiment with 80's, 90's, 2000's, and modern delay and reverb approaches, and I want to embrace the "shittiness" of certain approaches, as long as they add to or enhance the character.

Pretty archetypal example for me. I've got a £100 Boss RV-5 pedal here, and a £700 Meris MercuryX pedal. Both outstanding reverbs. The RV-5 is clearly less clever under the hood. You can hear the delays in the FDN every now and then, and it doesn't sound super high fidelity. But bloody hell is it cool!!! Way better than later Boss reverbs IMHO.

griffinboy

@Orvillain

I've never used sync / fir interpolators in a reverb.
I'm not sure how high the benefit would be.
For delays and synthesis it's great though when you need low aliasing and low ripple.

Orvillain

@griffinboy said in Ring Buffer design:

@Orvillain

I've never used sync / fir interpolators in a reverb.
I'm not sure how high the benefit would be.
For delays and synthesis it's great though when you need low aliasing and low ripple.

A typical approach to decorrelating reverbs is to mildly modulate delay times in the FDN. But you can also do it on the input diffuser or the early reflections APF network too. So anything that affects delay time is ultimately going to make some kind of difference.

It'd be interesting to compare them all. Maybe that's a new thread in a few weeks or so!

Orvillain

@griffinboy

So I was reading about mirrored ring buffers. For a normal ring delay, we have a circular buffer of size (n) - a write pointer advances and wraps back around when it reaches the end. A read pointer advances at an offset to the write pointer to create the delay amount. This creates the 1-tap delay effect.

Fractional reads require that sometimes the window of samples required for interpolation crosses the end of the buffer. That window not being contiguous in memory means we need to modulo to access all of the indexes required for the fractional read. This costs CPU for every read operation.

The mirror trick:

Instead of storing [0 ... n-1] we allow a buffer twice as large. Every time we write a sample into w, we also write it into w + N. This results in a mirrored copy of the entire signal.

Even though we allow twice the buffer, our real delay line is still only the single length. Meaning when we read, we read in a straight block. So the interpolator doesn't need to check the boundaries or even do an & mask operation for every tap.

I've not yet implemented it, but think I grok the theory behind it. Costs extra memory, but can be much faster for doing multi-tap fractional interpolation, because you avoid modulo and branching in the inner loop.

I'm going to implement it as a constructor boolean I think, so that my RingDelay class is flexible.

Christoph Hart

@Orvillain said in Ring Buffer design:

because you avoid modulo and branching in the inner loop.

If your buffer size is a power of two (which it always can be, just round it up), then modulo boils down to a bitwise AND operation with even the lowest optimization settings. This vastly outperforms the performance impact of having a bigger buffer, so I'm not 100% sure this is worth the hassle.

Orvillain

@Christoph-Hart Ahh interesting!

Even on embedded systems like ARM Cortex (I'm using the DaisySeed chip) ??

Also yes, I enforce power of 2 buffer at all times.

griffinboy

@Christoph-Hart
@Orvillain

Yep, if your buffer is a power of two you can use bitmasking.

The only case where I've used other strategies, is when my interpolator is very wide. My Fir convolution interpolator has 64 reads every sample, and so even bitmask wraps add up.
But in such situations I actually just use strategies to hoist the wraps. I don't do any checking at all inside the loop. I note down where the edges are and if we are nowhere near the edges we run a separate vectorized loop that avoids any checks.

Christoph Hart

@Orvillain that's such a basic optimization technique that I would expect any compiler made after 1975 will incorporate this.

If you don't trust that, you can easily do this yourself:

buffer[i % 4096]
buffer[i & 4095]

these are equivalent statements with integer numbers, one costs 88CPU cycles, and one is for free (~1 CPU cycle).

If you want a good read, this is the quasi standard reference for comparing CPU instructions. You need to dig around a bit to find what you're after, and again, you're basically doing yourself what the compiler will do for you already so it's purely educational:

https://www.agner.org/optimize/instruction_tables.pdf

the assembly instruction for division (and modulo) is IDIV which clocks at 88 CPU cycles. AND takes a single CPU cycle.

Ring Buffer design

37

2.0k

12.9k

111.8k