@ustk most likely missing knowledge due to sparse documentation :)
Efficiency wise it shouldn't make much of a difference, in the end the compiler will optimize this pretty well and all the index types in SNEX are templated so their interpolation can be derived on compile time.
I haven't played around with fixed-point math at all so I can't say whether that's much faster than using floating points, but there are a few things in the optimization notes from @griffinboy that are redundant:
- inline is basically worthless in C++ by now. Every compiler will inline small functions like this even with the lowest optimization settings.
- there's already a
juce::nextPowerOfTwo()
which contains the same code without the if branch which seems unnecessary.
- using
std::fmaf
for a simple task like a*b + c
seems to complicate the code and make it slower. Here's the godbolt output for this C++ code that compares it vs. its naive implementation:
#include <cmath>
float raw(float a, float b, float c)
{
return a * b + c;
}
float wrapped(float a, float b, float c)
{
return std::fmaf(a, b, c);
}
int main()
{
return 0;
}
x64 assembly output for MSVC with -O3:
float raw(float,float,float) PROC ; raw
movss DWORD PTR [rsp+24], xmm2
movss DWORD PTR [rsp+16], xmm1
movss DWORD PTR [rsp+8], xmm0
movss xmm0, DWORD PTR a$[rsp]
>> mulss xmm0, DWORD PTR b$[rsp]
>> addss xmm0, DWORD PTR c$[rsp]
ret 0
float wrapped(float,float,float) PROC
movss DWORD PTR [rsp+24], xmm2
movss DWORD PTR [rsp+16], xmm1
movss DWORD PTR [rsp+8], xmm0
sub rsp, 40 ; 00000028H
movss xmm2, DWORD PTR c$[rsp]
movss xmm1, DWORD PTR b$[rsp]
movss xmm0, DWORD PTR a$[rsp]
>> call QWORD PTR __imp_fmaf
add rsp, 40 ; 00000028H
ret 0
You don't need to be able to fully understand or write assembly to extract valuable information out of that, the most obvious thing is more lines in the second function which means more time spent there. But it gets worse when you take a closer look: I've marked the relevant lines with >>
. The first function boils down to two single instructions which are basically for free on a modern CPU while the wrapped function invokes a function call including copying the values into call registers etc. which is a order of magnitude slower than the first example.
TLDR: Start with the implementation that is the easiest to understand / write / debug, then profile or put isolated pieces into Godbolt to see what can be improved.
EDIT: I've messed up the compiler flags somehow (-O3
is a clang compiler flag, with MSVC it's -Ox
, the fully optimized assembly looks like this:
float raw(float,float,float) PROC ; raw
mulss xmm0, xmm1
addss xmm0, xmm2
ret 0
float wrapped(float,float,float) PROC ; wrapped
jmp fmaf
so it calls the inbuilt CPU instruction directly which indeed seems quicker, but I would still expect that this is not making much of a difference.