@Orvillain
Keep posting!
It's great to see c++ stuff here, I'm having fun looking at your work.
Here is my take on your concept!
This may give more to chew on if you want to explore common buffer optimizations.
Your example is really nice and clear: this will be less so. I'm not trying to overshadow any of your stuff at all, I just wondered if you'd be curious. I'll leave it here for the curious.
/*
Power-of-2 sized ring buffer delay with cubic interpolation.
Uses fixed-point math for fractional addressing: the high 16 bits store the integer index,
the low 16 bits store the fractional position between samples. This makes wrapping branchless, so each read is very cheap on the CPU.
HOW TO USE:
1. Call setSize(minCapacitySamples) once to allocate the buffer.
- The actual buffer allocated will be to the next power-of-two.
- This function sets the maximum possible delay length.
2. Each new input sample, call push(x) to write it into the delay line.
3. To read a delayed sample, call readCubic(delaySamples).
- delaySamples can be any float value (integer or fractional) up to the buffer size.
For example: delaySamples = 4410.5f gives a 100 ms delay at 44.1 kHz
*/
#pragma once
#include <vector>
#include <algorithm>
#include <cstdint>
#include <cmath>
struct RingDelay
{
// fixed point settings: 16 fractional bits
static constexpr int FP_BITS = 16;
static constexpr uint32_t FP_ONE = 1u << FP_BITS;
static constexpr uint32_t FP_FRAC_MASK = FP_ONE - 1u;
static constexpr float FP_INV = 1.0f / float(FP_ONE); // precomputed reciprocal
std::vector<float> buf;
int w = 0;
int mask = 0;
static inline int nextPow2(int x) noexcept
{
if (x <= 1) return 1;
--x;
x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16;
return x + 1;
}
inline void setSize(int minCapacitySamples)
{
if (minCapacitySamples <= 0) minCapacitySamples = 1;
const int n = nextPow2(minCapacitySamples);
buf.assign((size_t)n, 0.0f);
mask = n - 1;
w = 0;
}
inline void push(float x) noexcept
{
buf[(size_t)w] = x;
w = (w + 1) & mask;
}
// cubic interpolation between 4 samples
static inline float cubicInterp(float s0, float s1, float s2, float s3, float f) noexcept
{
float c0 = s1;
float c1 = 0.5f * (s2 - s0);
float c2 = 0.5f * (2.0f*s0 - 5.0f*s1 + 4.0f*s2 - s3);
float c3 = 0.5f * (3.0f*(s1 - s2) + s3 - s0);
float t = std::fmaf(c3, f, c2);
t = std::fmaf(t, f, c1);
return std::fmaf(t, f, c0);
}
// fractional read using fixed-point address (Q16.16)
inline float readCubic(float delaySamples) const noexcept
{
if (delaySamples < 0.0f) delaySamples = 0.0f;
// convert delay into fixed point
const uint32_t delayQ = (uint32_t)(delaySamples * float(FP_ONE) + 0.5f);
// make fixed-point write pointer
const uint32_t wQ = (uint32_t)w << FP_BITS;
// wrap both integer and fractional parts with one AND
const uint32_t sizeMaskQ = (uint32_t(mask) << FP_BITS) | FP_FRAC_MASK;
const uint32_t rpQ = (wQ - delayQ) & sizeMaskQ;
// integer and fractional parts
const int i1 = int(rpQ >> FP_BITS);
const float f = float(rpQ & FP_FRAC_MASK) * FP_INV;
// neighbours
const int i0 = (i1 - 1) & mask;
const int i2 = (i1 + 1) & mask;
const int i3 = (i1 + 2) & mask;
return cubicInterp(buf[(size_t)i0], buf[(size_t)i1], buf[(size_t)i2], buf[(size_t)i3], f);
}
inline int size() const noexcept { return mask + 1; }
inline void clear() noexcept
{
std::fill(buf.begin(), buf.end(), 0.0f);
w = 0;
}
};
1. Fixed-point addressing
- Delay time is split into fixed point math.
We store the integer + fractional part as a 32-bit value.
(Fixed-point is better because it stores the whole and fractional
parts together in one integer, so the CPU can grab them with
simple bit-ops instead of doing slower math. With a float, the
whole part and fractional part are hidden inside its binary
format, so you need extra math (like floor and division) to separate them)
- Wrapping is now done with a single bitmask, no floor or % calls.
2. Removed redundant mask
- i1 index already wraps by sizeMaskQ. No extra mask is needed.
3. Fused multiply-add (FMA) cubic interpolation
- We use the Horner form here (fewer multiplications and rounding steps) It's
faster on CPUs that support FMA.
4. Inlined hot methods
- We mark small functions as inline. This removes function call overhead.
5. Efficient power-of-two size calculation
- nextPow2 uses bit-twiddling. No loops