HISE Logo Forum
    • Categories
    • Register
    • Login

    Ring Buffer design

    Scheduled Pinned Locked Moved C++ Development
    20 Posts 4 Posters 506 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • OrvillainO
      Orvillain
      last edited by Orvillain

      Now I'm not too proud to say that I had to use ChatGPT to teach me this. But I actually fully understand it now, and it is really cool. I've studied and commented it myself to make sure I fully grok it.

      static inline float cubic4(float s0, float s1, float s2, float s3, float f)
      {
          float a0 = -0.5f * s0 + 1.5f * s1 - 1.5f * s2 + 0.5f * s3;
          float a1 =  s0        - 2.5f * s1 + 2.0f * s2 - 0.5f * s3;
          float a2 = -0.5f * s0 + 0.5f * s2;
          float a3 =  s1;
          return ((a0 * f + a1) * f + a2) * f + a3;
      }
      
      struct RingDelay
      {
      	// holds all of the sample data
      	std::vector<float> buf;
      
      	// write position
      	int w = 0;
      
      	// bitmask for fast wrap-around (which is why the buffer must always be power-of-2)
      	int mask = 0;
      
      	// sets the size of the buffer according to requested size in samples
      	// will set the buffer to a power-of-two size above the requested capacity
      	// for example - minCapacitySamples==3000, n==4096
      	void setSize(int minCapacitySamples)
      	{
      		// start off with n=1
      		int n = 1;
      
      		// keep doubling n until it is greater or equal to minCapacitySamples
      		while (n < minCapacitySamples)
      		{
      			n <<= 1;
      		}
      
      		// set the size of the buffer to n, and fill with zeros
      		buf.assign(n, 0.0f);
      
      		// mask is now n-1; 4095 in the example
      		mask = n - 1;
      		
      		// reset the write pointer to zero
      		w = 0;
      	}
      
      	// push a sample value into the buffer at the write position
      	// this will always wrap around the capacity length because of the mask value being set prior
      	void push(float x)
      	{
      		buf[w] = x; // set the current write position to the sample value
      		w = (w + 1) & mask; // increment w by 1. Use bitwise 'AND' operator to perform a wrap
      	}
      
      	// Performs a cubic interpolation read operation on the buffer at the specified sample position
      	// This can be a fractional number
      	float readCubic(float delaySamples) const
      	{
      		// w is the next write position. Read back from that according to delaySamples.
      		float rp = static_cast<float>(w) - delaySamples;
      		// wrap this read pointer into the range 0-size, where size=mask+1
      		rp -= std::floor(rp / static_cast<float>(mask + 1)) * static_cast<float>(mask + 1);
      
      		// the floor of rp - the integer part
      		int i1 = static_cast<int>(rp);
      
      		// the decimal part
      		float f = rp - static_cast<float>(i1);
      
      		// grab the neighbours around i1
      		int i0 = (i1 - 1) & mask;
      		int i2 = (i1 + 1) & mask;
      		int i3 = (i1 + 2) & mask;
      
      		// feed those numbers into the cubic interpolator
      		return cubic4(buf[i0], buf[i1 & mask], buf[i2], buf[i3], f);
      	}
      
      	// returns the size of the buffer
      	int size() const
      	{
      		return mask + 1;
      	}
      
      	// clear the buffer without changing the size and reset the write pointer
      	void clear()
      	{
      		std::fill(buf.begin(), buf.end(), 0.0f);
      		w = 0;
      	}
      };
      
      
      

      You'd call setSize inside the prepare method. And then in processFrame you'd push your sample-value, and the whole thing would just work in a ring, without having to manage any incrementing of pointers inside the processFrame method yourself.

      I'd always heard devs talk about ring buffers and ring delays, and now I get it.

      I don't fully understand the cubic function beyond something like "take an average of all 4 values and return it" - but it does work... and this can be used to build delays that smoothly interpolate when changing the delay time.

      Musician - Instrument Designer - Sonic Architect - Creative Product Owner
      Crafting sound at every level. From strings to signal paths, samples to systems.

      1 Reply Last reply Reply Quote 1
      • OrvillainO
        Orvillain
        last edited by Orvillain

        Here is a super contrived example. If you compile this as a node, and add it in your scriptnode layout... it will delay the right signal by 2 seconds.

        #pragma once
        #include <JuceHeader.h>
        
        namespace project
        {
        using namespace juce;
        using namespace hise;
        using namespace scriptnode;
        
        static inline float cubic4(float s0, float s1, float s2, float s3, float f)
        {
            float a0 = -0.5f * s0 + 1.5f * s1 - 1.5f * s2 + 0.5f * s3;
            float a1 =  s0        - 2.5f * s1 + 2.0f * s2 - 0.5f * s3;
            float a2 = -0.5f * s0 + 0.5f * s2;
            float a3 =  s1;
            return ((a0 * f + a1) * f + a2) * f + a3;
        }
        
        struct RingDelay
        {
        	// holds all of the sample data
        	std::vector<float> buf;
        
        	// write position
        	int w = 0;
        
        	// bitmask for fast wrap-around (which is why the buffer must always be power-of-2)
        	int mask = 0;
        
        	// sets the size of the buffer according to requested size in samples
        	// will set the buffer to a power-of-two size above the requested capacity
        	// for example - minCapacitySamples==3000, n==4096
        	void setSize(int minCapacitySamples)
        	{
        		// start off with n=1
        		int n = 1;
        
        		// keep doubling n until it is greater or equal to minCapacitySamples
        		while (n < minCapacitySamples)
        		{
        			n <<= 1;
        		}
        
        		// set the size of the buffer to n, and fill with zeros
        		buf.assign(n, 0.0f);
        
        		// mask is now n-1; 4095 in the example
        		mask = n - 1;
        		
        		// reset the write pointer to zero
        		w = 0;
        	}
        
        	// push a sample value into the buffer at the write position
        	// this will always wrap around the capacity length because of the mask value being set prior
        	void push(float x)
        	{
        		buf[w] = x; // set the current write position to the sample value
        		w = (w + 1) & mask; // increment w by 1. Use bitwise 'AND' operator to perform a wrap
        	}
        
        	// Performs a cubic interpolation read operation on the buffer at the specified sample position
        	// This can be a fractional number
        	float readCubic(float delaySamples) const
        	{
        		// w is the next write position. Read back from that according to delaySamples.
        		float rp = static_cast<float>(w) - delaySamples;
        		// wrap this read pointer into the range 0-size, where size=mask+1
        		rp -= std::floor(rp / static_cast<float>(mask + 1)) * static_cast<float>(mask + 1);
        
        		// the floor of rp - the integer part
        		int i1 = static_cast<int>(rp);
        
        		// the decimal part
        		float f = rp - static_cast<float>(i1);
        
        		// grab the neighbours around i1
        		int i0 = (i1 - 1) & mask;
        		int i2 = (i1 + 1) & mask;
        		int i3 = (i1 + 2) & mask;
        
        		// feed those numbers into the cubic interpolator
        		return cubic4(buf[i0], buf[i1 & mask], buf[i2], buf[i3], f);
        	}
        
        	// returns the size of the buffer
        	int size() const
        	{
        		return mask + 1;
        	}
        
        	// clear the buffer without changing the size and reset the write pointer
        	void clear()
        	{
        		std::fill(buf.begin(), buf.end(), 0.0f);
        		w = 0;
        	}
        };
        
        
        
        // ==========================| The node class with all required callbacks |==========================
        
        template <int NV> struct RingBufferExp: public data::base
        {
        	// Metadata Definitions ------------------------------------------------------------------------
        	
        	SNEX_NODE(RingBufferExp);
        	
        	struct MetadataClass
        	{
        		SN_NODE_ID("RingBufferExp");
        	};
        	
        	// set to true if you want this node to have a modulation dragger
        	static constexpr bool isModNode() { return false; };
        	static constexpr bool isPolyphonic() { return NV > 1; };
        	// set to true if your node produces a tail
        	static constexpr bool hasTail() { return false; };
        	// set to true if your doesn't generate sound from silence and can be suspended when the input signal is silent
        	static constexpr bool isSuspendedOnSilence() { return false; };
        	// Undefine this method if you want a dynamic channel count
        	static constexpr int getFixChannelAmount() { return 2; };
        	
        	// Define the amount and types of external data slots you want to use
        	static constexpr int NumTables = 0;
        	static constexpr int NumSliderPacks = 0;
        	static constexpr int NumAudioFiles = 0;
        	static constexpr int NumFilters = 0;
        	static constexpr int NumDisplayBuffers = 0;
        	
        	// components
        	double sampleRate = 48000.0;
        	RingDelay rd;
        	
        	// Helpers
        	
        	// Converts milliseconds to samples and returns as an integer
        	static inline int msToSamplesInt(float ms, double fs)
        	{
        		return (int)std::ceil(ms * fs / 1000.0);
        	}
        
        	// Converts milliseconds to samples and returns as a float
        	static inline float msToSamplesFloat(float ms, double fs)
        	{
        		return (float)(ms * fs / 1000.0);
        	}
        	
        	// Scriptnode Callbacks ------------------------------------------------------------------------
        	
        	void prepare(PrepareSpecs specs)
        	{
        		// update the sampleRate constant to be the current sample rate
        		sampleRate = specs.sampleRate;
        		
        		// we arbitrarily invent a pad guard number to add to the length of the ring delay
        		const int guard = 128;
        		
        		// set the size using msToSamplesInt because setSize expects an integer
        		rd.setSize(msToSamplesInt(10000.0f, sampleRate) + guard);
        	}
        	
        	void reset()
        	{
        		
        	}
        	
        	void handleHiseEvent(HiseEvent& e)
        	{
        		
        	}
        	
        	template <typename T> void process(T& data)
        	{
        		
        		static constexpr int NumChannels = getFixChannelAmount();
        		// Cast the dynamic channel data to a fixed channel amount
        		auto& fixData = data.template as<ProcessData<NumChannels>>();
        		
        		// Create a FrameProcessor object
        		auto fd = fixData.toFrameData();
        		
        		while(fd.next())
        		{
        			// Forward to frame processing
        			processFrame(fd.toSpan());
        		}
        		
        	}
        	
        	template <typename T> void processFrame(T& data)
        	{
        		// Separate out the stereo input
        		float L = data[0];
        		float R = data[1];
        		
        		// Calculate the delay time we want in samples - expects float
        		float dTime = msToSamplesFloat(2000.0f, sampleRate);
        		
        		// read the value according to the delay time we just setup
        		float dR = rd.readCubic(dTime);
        		
        		// Push just the right channel into our ring delay
        		// remember, this will auto-increment the write pointer
        		rd.push(R);
        		
        		// left channel - Write the original audio back to the datastream
        		data[0] = L;
        		// Right channel - Write the delayed audio back to the datastream
        		data[1] = dR;
        	}
        	
        	int handleModulation(double& value)
        	{
        		
        		return 0;
        		
        	}
        	
        	void setExternalData(const ExternalData& data, int index)
        	{
        		
        	}
        	// Parameter Functions -------------------------------------------------------------------------
        	
        	template <int P> void setParameter(double v)
        	{
        		if (P == 0)
        		{
        			// This will be executed for MyParameter (see below)
        			jassertfalse;
        		}
        		
        	}
        	
        	void createParameters(ParameterDataList& data)
        	{
        		{
        			// Create a parameter like this
        			parameter::data p("MyParameter", { 0.0, 1.0 });
        			// The template parameter (<0>) will be forwarded to setParameter<P>()
        			registerCallback<0>(p);
        			p.setDefaultValue(0.5);
        			data.add(std::move(p));
        		}
        	}
        };
        }
        
        
        
        

        Musician - Instrument Designer - Sonic Architect - Creative Product Owner
        Crafting sound at every level. From strings to signal paths, samples to systems.

        griffinboyG 1 Reply Last reply Reply Quote 2
        • griffinboyG
          griffinboy @Orvillain
          last edited by griffinboy

          @Orvillain

          Keep posting!
          It's great to see c++ stuff here, I'm having fun looking at your work.

          Here is my take on your concept!
          This may give more to chew on if you want to explore common buffer optimizations.
          Your example is really nice and clear: this will be less so. I'm not trying to overshadow any of your stuff at all, I just wondered if you'd be curious. I'll leave it here for the curious.

          /*
              Power-of-2 sized ring buffer delay with cubic interpolation.
              Uses fixed-point math for fractional addressing: the high 16 bits store the integer index,
              the low 16 bits store the fractional position between samples. This makes wrapping branchless, so each read is very cheap on the CPU.
          
              HOW TO USE:
                1. Call setSize(minCapacitySamples) once to allocate the buffer.
                   - The actual buffer allocated will be to the next power-of-two.
                   - This function sets the maximum possible delay length.
                2. Each new input sample, call push(x) to write it into the delay line.
                3. To read a delayed sample, call readCubic(delaySamples).
                   - delaySamples can be any float value (integer or fractional) up to the buffer size.
                   For example: delaySamples = 4410.5f gives a 100 ms delay at 44.1 kHz 
          */
          
          
          #pragma once
          #include <vector>
          #include <algorithm>
          #include <cstdint>
          #include <cmath>
          
          struct RingDelay
          {
              // fixed point settings: 16 fractional bits
              static constexpr int FP_BITS = 16;
              static constexpr uint32_t FP_ONE = 1u << FP_BITS;
              static constexpr uint32_t FP_FRAC_MASK = FP_ONE - 1u;
              static constexpr float FP_INV = 1.0f / float(FP_ONE); // precomputed reciprocal
          
              std::vector<float> buf;
              int w = 0;
              int mask = 0;
          
              static inline int nextPow2(int x) noexcept
              {
                  if (x <= 1) return 1;
                  --x;
                  x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16;
                  return x + 1;
              }
          
              inline void setSize(int minCapacitySamples)
              {
                  if (minCapacitySamples <= 0) minCapacitySamples = 1;
                  const int n = nextPow2(minCapacitySamples);
                  buf.assign((size_t)n, 0.0f);
                  mask = n - 1;
                  w = 0;
              }
          
              inline void push(float x) noexcept
              {
                  buf[(size_t)w] = x;
                  w = (w + 1) & mask;
              }
          
              // cubic interpolation between 4 samples
              static inline float cubicInterp(float s0, float s1, float s2, float s3, float f) noexcept
              {
                  float c0 = s1;
                  float c1 = 0.5f * (s2 - s0);
                  float c2 = 0.5f * (2.0f*s0 - 5.0f*s1 + 4.0f*s2 - s3);
                  float c3 = 0.5f * (3.0f*(s1 - s2) + s3 - s0);
                  float t  = std::fmaf(c3, f, c2);
                  t        = std::fmaf(t,  f, c1);
                  return std::fmaf(t,  f, c0);
              }
          
              // fractional read using fixed-point address (Q16.16)
              inline float readCubic(float delaySamples) const noexcept
              {
                  if (delaySamples < 0.0f) delaySamples = 0.0f;
          
                  // convert delay into fixed point
                  const uint32_t delayQ = (uint32_t)(delaySamples * float(FP_ONE) + 0.5f);
          
                  // make fixed-point write pointer
                  const uint32_t wQ = (uint32_t)w << FP_BITS;
          
                  // wrap both integer and fractional parts with one AND
                  const uint32_t sizeMaskQ = (uint32_t(mask) << FP_BITS) | FP_FRAC_MASK;
                  const uint32_t rpQ = (wQ - delayQ) & sizeMaskQ;
          
                  // integer and fractional parts 
                  const int   i1 = int(rpQ >> FP_BITS);
                  const float f  = float(rpQ & FP_FRAC_MASK) * FP_INV;
          
                  // neighbours
                  const int i0 = (i1 - 1) & mask;
                  const int i2 = (i1 + 1) & mask;
                  const int i3 = (i1 + 2) & mask;
          
                  return cubicInterp(buf[(size_t)i0], buf[(size_t)i1], buf[(size_t)i2], buf[(size_t)i3], f);
              }
          
              inline int size() const noexcept { return mask + 1; }
          
              inline void clear() noexcept
              {
                  std::fill(buf.begin(), buf.end(), 0.0f);
                  w = 0;
              }
          };
          
          
          
          1. Fixed-point addressing 
             - Delay time is split into fixed point math. 
          We store the integer + fractional part as a 32-bit value. 
          (Fixed-point is better because it stores the whole and fractional 
          parts together in one integer, so the CPU can grab them with 
          simple bit-ops instead of doing slower math. With a float, the 
          whole part and fractional part are hidden inside its binary 
          format, so you need extra math (like floor and division) to separate them)
             - Wrapping is now done with a single bitmask, no floor or % calls.
          
          2. Removed redundant mask
             - i1 index already wraps by sizeMaskQ. No extra mask is needed.
          
          3. Fused multiply-add (FMA) cubic interpolation
             - We use the Horner form here (fewer multiplications and rounding steps) It's
           faster on CPUs that support FMA.
          
          4. Inlined hot methods
             - We mark small functions as inline. This removes function call overhead.
          
          5. Efficient power-of-two size calculation
             - nextPow2 uses bit-twiddling. No loops
          
          
          ustkU 1 Reply Last reply Reply Quote 1
          • ustkU
            ustk @griffinboy
            last edited by

            @griffinboy @Orvillain Is there a particular reason not to use the SNEX ready made interpolators and fractional index reading? Efficiency maybe?

            Hise made me an F5 dude, browser just suffers...

            griffinboyG Christoph HartC OrvillainO 3 Replies Last reply Reply Quote 0
            • griffinboyG
              griffinboy @ustk
              last edited by griffinboy

              @ustk

              absolutely efficiency. The built in ones aren't bad by any means, but they are generic rather than made specific for your needs only. And thus they miss out on a few optimizations.

              But apart from that they are also not amazing.
              I don't use cubic interpolators on my delay lines, I use FIR filters (using SIMD) using impulses which are equivalent to windowed sync interpolation. But for this example I left in the cubic. Else it would be multiple headers worth of scripts as we start to worry about upsampling downsampling, convolution.
              Which is perhaps overkill by the way, I partly do it because I can. Not because anyone cares.

              OrvillainO 1 Reply Last reply Reply Quote 1
              • Christoph HartC
                Christoph Hart @ustk
                last edited by Christoph Hart

                @ustk most likely missing knowledge due to sparse documentation :)

                Efficiency wise it shouldn't make much of a difference, in the end the compiler will optimize this pretty well and all the index types in SNEX are templated so their interpolation can be derived on compile time.

                I haven't played around with fixed-point math at all so I can't say whether that's much faster than using floating points, but there are a few things in the optimization notes from @griffinboy that are redundant:

                • inline is basically worthless in C++ by now. Every compiler will inline small functions like this even with the lowest optimization settings.
                • there's already a juce::nextPowerOfTwo() which contains the same code without the if branch which seems unnecessary.
                • using std::fmaf for a simple task like a*b + c seems to complicate the code and make it slower. Here's the godbolt output for this C++ code that compares it vs. its naive implementation:
                #include <cmath>
                
                
                float raw(float a, float b, float c)
                {
                    return a * b + c;
                }
                
                float wrapped(float a, float b, float c)
                {
                    return std::fmaf(a, b, c);
                }
                
                int main()
                {
                    return 0;
                }
                

                x64 assembly output for MSVC with -O3:

                float raw(float,float,float) PROC                           ; raw
                        movss   DWORD PTR [rsp+24], xmm2
                        movss   DWORD PTR [rsp+16], xmm1
                        movss   DWORD PTR [rsp+8], xmm0
                        movss   xmm0, DWORD PTR a$[rsp]
                >>      mulss   xmm0, DWORD PTR b$[rsp]
                >>      addss   xmm0, DWORD PTR c$[rsp]
                        ret     0
                
                float wrapped(float,float,float) PROC                    
                        movss   DWORD PTR [rsp+24], xmm2
                        movss   DWORD PTR [rsp+16], xmm1
                        movss   DWORD PTR [rsp+8], xmm0
                        sub     rsp, 40                             ; 00000028H
                        movss   xmm2, DWORD PTR c$[rsp]
                        movss   xmm1, DWORD PTR b$[rsp]
                        movss   xmm0, DWORD PTR a$[rsp]
                >>      call    QWORD PTR __imp_fmaf
                        add     rsp, 40                             ; 00000028H
                        ret     0
                

                You don't need to be able to fully understand or write assembly to extract valuable information out of that, the most obvious thing is more lines in the second function which means more time spent there. But it gets worse when you take a closer look: I've marked the relevant lines with >>. The first function boils down to two single instructions which are basically for free on a modern CPU while the wrapped function invokes a function call including copying the values into call registers etc. which is a order of magnitude slower than the first example.

                TLDR: Start with the implementation that is the easiest to understand / write / debug, then profile or put isolated pieces into Godbolt to see what can be improved.

                EDIT: I've messed up the compiler flags somehow (-O3 is a clang compiler flag, with MSVC it's -Ox, the fully optimized assembly looks like this:

                float raw(float,float,float) PROC                           ; raw
                        mulss   xmm0, xmm1
                        addss   xmm0, xmm2
                        ret     0
                
                float wrapped(float,float,float) PROC                       ; wrapped
                        jmp     fmaf
                

                so it calls the inbuilt CPU instruction directly which indeed seems quicker, but I would still expect that this is not making much of a difference.

                griffinboyG OrvillainO 2 Replies Last reply Reply Quote 4
                • griffinboyG
                  griffinboy @Christoph Hart
                  last edited by

                  @Christoph-Hart

                  Thanks for the rebuke! Looks like I need to do my homework

                  Christoph HartC 1 Reply Last reply Reply Quote 0
                  • Christoph HartC
                    Christoph Hart @griffinboy
                    last edited by

                    @griffinboy no worries, this is very advanced stuff, and with DSP it's always good to have performance in mind but I really recommend to spend some time with godbolt after watching a 20 minute introduction video to assembly (just make sure to pick a certain CPU instruction set for this as the syntax varies between CPU types - I'm pretty familiar with x64 assembly which I've studied extensively when writing the SNEX optimizer, but the NEON instruction set for Apple Silicon CPUs is basically gibberish for me).

                    It helped me a lot getting rid of the anxiety of writing slow code - you can expect that the compilers are able to make these low-level type optimizations pretty well and seeing the actual instructions the C++ code spits out does confirm that in most cases without having to resort to profiling.

                    Christoph HartC 1 Reply Last reply Reply Quote 2
                    • Christoph HartC
                      Christoph Hart @Christoph Hart
                      last edited by Christoph Hart

                      Here's another example:

                      Wrapping is now done with a single bitmask, no floor or % calls.

                      So your assumption is that if you know that the limit is a power of two you can skip the expensive modulo call (which is the same as a division) and resort to bit fiddling (a & (LIMIT - 1). This is theoretically true, but this example shows you that the compiler is as smart as you so there's no need to do this yourself

                      I admit I'm using this too a lot in the HISE codebase and it's not per se "bad coding style" (as it still speeds up the debug builds), but you don't leave performance on the table by not doing so - in fact the index::wrap function just uses the modulo operator and since you can give it a compile time upper limit it will benefit from this optimization for power-of two buffer sizes too.

                      So the code example calls the default modulo operator, but one time with a unknown variable and once with a guaranteed power of two. The first function call results in:

                      idiv    r8d
                      

                      which is bad (idiv is division so slow). The second one results in:

                      dec     ecx
                      or      ecx, -32                            ; ffffffffffffffe0H
                      inc     ecx
                      

                      which is more or less the bit fiddling formula above (I wouldn't try to understand what it does exactly, but a simple look at the instruction names hint at some very lightweight instructions).

                      griffinboyG 1 Reply Last reply Reply Quote 3
                      • griffinboyG
                        griffinboy @Christoph Hart
                        last edited by

                        @Christoph-Hart

                        Thanks for all this, it's super interesting.

                        I'm afraid I've turned this thread into an optimisation discussion, my fault whoops

                        ustkU 1 Reply Last reply Reply Quote 1
                        • ustkU
                          ustk @griffinboy
                          last edited by

                          @griffinboy all the responsibility is mine, not even contributing... But hell I'm glad I've done this lol 😆 learning a lot 👍

                          Hise made me an F5 dude, browser just suffers...

                          1 Reply Last reply Reply Quote 0
                          • OrvillainO
                            Orvillain
                            last edited by

                            Nice additions!

                            So the reason I started learning about bitwise operations is because eventually I'll want to run the same code (or at least port it over) to the DaisySeed platform, so performance is reasonably important. But I've not even really worried about trying to squeeze stuff into fixed point math.

                            I'm definitely curious though!

                            @griffinboy no worries at all!

                            Musician - Instrument Designer - Sonic Architect - Creative Product Owner
                            Crafting sound at every level. From strings to signal paths, samples to systems.

                            1 Reply Last reply Reply Quote 0
                            • OrvillainO
                              Orvillain @ustk
                              last edited by

                              @ustk As I understand it, SNEX is a subset of JUCE and c++ and doesn't have the full capabilities. I have used it for some simple waveshaping and I did also write a sample playback node in it, but I hit some snags that at the time I wasn't able to work around. But probably just a knowledge hole that hopefully by now I've filled!!

                              The last thing I wrote in it was a sample+hold effect.

                              Musician - Instrument Designer - Sonic Architect - Creative Product Owner
                              Crafting sound at every level. From strings to signal paths, samples to systems.

                              1 Reply Last reply Reply Quote 1
                              • OrvillainO
                                Orvillain @Christoph Hart
                                last edited by

                                @Christoph-Hart said in Ring Buffer design:

                                TLDR: Start with the implementation that is the easiest to understand / write / debug, then profile or put isolated pieces into Godbolt to see what can be improved.

                                This was my thought too - it isn't any good writing code if I come back to it a week later and can barely understand it. So I try to approach things in the easiest to understand way first, even if it is a bit knuckleheaded. You can always optimize the hell out of a for loop or array access, later on down the line.

                                Musician - Instrument Designer - Sonic Architect - Creative Product Owner
                                Crafting sound at every level. From strings to signal paths, samples to systems.

                                1 Reply Last reply Reply Quote 1
                                • OrvillainO
                                  Orvillain @griffinboy
                                  last edited by

                                  @griffinboy said in Ring Buffer design:

                                  I don't use cubic interpolators on my delay lines, I use FIR filters (using SIMD) using impulses which are equivalent to windowed sync interpolation.

                                  Hey thanks dude for this. I went off on a ChatGPT research mission, and I think I'm starting to understand it. I've got it implemented anyway, it seems to work.

                                  Here's how I'm understanding it, again with a bit of help!!!

                                  ....

                                  Sinc uses a band-limited approach to reconstruct the original continuous signal. But computing a fresh sinc for every fractional position would be super expensive. So we precompute lots of sinc kernels and whack 'em in a table.

                                  The table has a bunch of rows - phases.
                                  Each row contains taps - the weights that nearby samples get multiplied by. The more taps you have gives you a better frequency response, but more CPU usage.

                                  Each row is then a windowed sinc where 'k' is the taps integer offset from the centre sample, 'f' is the fractional position. You take a window function like Hann or Blackman-Harris, and that gentle fades the ends so that the limited sinc doesn't pre-ring.

                                  After building the row, normalize it so it sums to 1. This keeps DC level stable.

                                  Store all rows flat in a coeffs array, with some helpers to read the correct row - rowNearest and rowBlended.

                                  Nearest gives you the closest phase row for the given fraction.
                                  Blended gives the two neighbouring rows plus a blend factor, so you can crossfade across the two.

                                  Then inside my RingDelay, my new 'getReadPos' function gives an integer index and a fractional part. This can be used with any of my interpolation methods (nearest, linear, cubic, lagrange3, and now sinc)

                                  readSinc uses nearest. readSincBlended uses blended.

                                  For each tap, lookup the neighbouring samples around i1, and multiply by the weights. Sum them up, and that gives the interpolated value.

                                  So Cubic or Lagrange - uses 4 samples, and looks smooth. But is not truly band limited. It rolls off highs in the pass band, so things sound slightly dull. Stop band rejection is not great, so it can alias or cause whistle tones.

                                  Windowed Sinc - approximates the ideal reconstruction. Much flatter passband so it keeps brightness. Much stronger stopband attenuation, so there is less aliasing. It is also linear phase, so transients don't get skewed. When you sweep the delay time, it stays clean and consistent. But it costs more CPU.

                                  Musician - Instrument Designer - Sonic Architect - Creative Product Owner
                                  Crafting sound at every level. From strings to signal paths, samples to systems.

                                  griffinboyG 1 Reply Last reply Reply Quote 0
                                  • griffinboyG
                                    griffinboy @Orvillain
                                    last edited by griffinboy

                                    @Orvillain

                                    That's exactly it.
                                    And the cool thing about using a fir interpolator like this with multiple taps, is that you can SIMD it
                                    Process 4 taps in one SIMD instruction cut CPU significantly.

                                    Hise has XSIMD included.
                                    You don't need to include it you can just use it in a c++ node

                                    Oh but it has latency potentially to use this technique. So it's not good for certain uses where it's too fiddly to compensate latency in different places.
                                    But it's good for delay effects and for reading samples from a buffer at a new speed.

                                    OrvillainO 1 Reply Last reply Reply Quote 1
                                    • OrvillainO
                                      Orvillain
                                      last edited by

                                      So yeah, what I've implemented in my ring delay are the following:

                                      Read Nearest: uses 1 sample. It is the worst quality in terms of aliasing and modulation smoothness. But the lowest CPU usage.
                                      Read Linear: Uses 2 samples. It has audible HF loss, the aliasing is poor, but the modulation is a level above nearest. CPU usage is still low.
                                      Read Cubic: Uses 4 samples. Somewhat noticeable HF roll-off, aliasing performance is fair, and the modulation is surprisingly smooth. Medium CPU usage.
                                      Read Lagrange3: Uses 4 samples. Better HF retention than cubic, but still rolls off. Aliasing rejection is okay, but does let some junk through. Smooth modulation. Medium CPU usage.
                                      Read Sinc: Has 16 taps, which means it takes 16 input samples around the read point and does 16 FMA's to produce one output sample. The passband is very flat, no HF loss. Aliasing performance is excellent. Modulation is excellent. CPU is relatively high - but here on my machine never spiked above 1%.
                                      Read Sinc Blended: Also takes 16 input samples, but because we have a blend it performs 32 FMA's to produce one output sample. The passband is very flat, no HF loss. Aliasing performance is excellent. Modulation is excellent. CPU is relatively high - but here on my machine never spiked above 2%.

                                      It will be interesting when I get this stuff onto an embedded platform like the DaisySeed (ARM Cortex based) and then I can see how it all works out CPU-wise.

                                      Musician - Instrument Designer - Sonic Architect - Creative Product Owner
                                      Crafting sound at every level. From strings to signal paths, samples to systems.

                                      1 Reply Last reply Reply Quote 0
                                      • OrvillainO
                                        Orvillain @griffinboy
                                        last edited by Orvillain

                                        @griffinboy said in Ring Buffer design:

                                        @Orvillain

                                        That's exactly it.
                                        And the cool thing about using a fir interpolator like this with multiple taps, is that you can SIMD it
                                        Process 4 taps in one SIMD instruction cut CPU significantly.

                                        Hise has XSIMD included.
                                        You don't need to include it you can just use it in a c++ node

                                        Oh but it has latency potentially to use this technique. So it's not good for certain uses where it's too fiddly to compensate latency in different places.
                                        But it's good for delay effects and for reading samples from a buffer at a new speed.

                                        Indeed! Indeed! I haven't got to any of this yet. Just using basic c++ loops without any SIMD or vectorization.

                                        I'm looking at this purely for chorus or modulation effects, delays, reverbs, and sample playback.

                                        The reason I've implemented all of these is because I want to experiment with 80's, 90's, 2000's, and modern delay and reverb approaches, and I want to embrace the "shittiness" of certain approaches, as long as they add to or enhance the character.

                                        Pretty archetypal example for me. I've got a £100 Boss RV-5 pedal here, and a £700 Meris MercuryX pedal. Both outstanding reverbs. The RV-5 is clearly less clever under the hood. You can hear the delays in the FDN every now and then, and it doesn't sound super high fidelity. But bloody hell is it cool!!! Way better than later Boss reverbs IMHO.

                                        Musician - Instrument Designer - Sonic Architect - Creative Product Owner
                                        Crafting sound at every level. From strings to signal paths, samples to systems.

                                        griffinboyG 1 Reply Last reply Reply Quote 0
                                        • griffinboyG
                                          griffinboy @Orvillain
                                          last edited by

                                          @Orvillain

                                          I've never used sync / fir interpolators in a reverb.
                                          I'm not sure how high the benefit would be.
                                          For delays and synthesis it's great though when you need low aliasing and low ripple.

                                          OrvillainO 1 Reply Last reply Reply Quote 0
                                          • OrvillainO
                                            Orvillain @griffinboy
                                            last edited by Orvillain

                                            @griffinboy said in Ring Buffer design:

                                            @Orvillain

                                            I've never used sync / fir interpolators in a reverb.
                                            I'm not sure how high the benefit would be.
                                            For delays and synthesis it's great though when you need low aliasing and low ripple.

                                            A typical approach to decorrelating reverbs is to mildly modulate delay times in the FDN. But you can also do it on the input diffuser or the early reflections APF network too. So anything that affects delay time is ultimately going to make some kind of difference.

                                            It'd be interesting to compare them all. Maybe that's a new thread in a few weeks or so!

                                            Musician - Instrument Designer - Sonic Architect - Creative Product Owner
                                            Crafting sound at every level. From strings to signal paths, samples to systems.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post

                                            19

                                            Online

                                            1.9k

                                            Users

                                            12.4k

                                            Topics

                                            108.4k

                                            Posts