@Christoph-Hart Thanks for the clarification!

The CPUs instruction set for SIMD seems to be a big factor for the actual gains. I found a nice thesis by a guy named Anthony Blake called "Computing the fast Fourier transform on SIMD microprocessors", including lots of comparisons from quite recent machines (i7-2600, around 7yrs old "only":=). It included a high-performance FFT library called SFFT / Spiral, vDSP, Intel IPP and FFW3: (page 140 in pdf)

https://www.cs.waikato.ac.nz/~ihw/PhD_theses/Anthony_Blake.pdf

According to the benchmarks, another library called SFFT (Sparse Fast Fourier Transform) was used that partially trashed the others (FFW3/vDSP/IPP) around the 4-64 bit sets in lots of tests. But according to the Spiral SFFT page:

the algorithm can be faster than modern FFT libraries. However, the reference implementation is not optimized for modern hardware features such as the cache hierarchy, vector instruction sets, or multithreading.

Too bad, would have loved to stare at some more graphs and then going "I wish I understood the context". :) It is now being used in 3D FFT to analyze the trajectories of gravitational pulls or ... well space n shit 😁

@d-healey Thanks for the video, I saw it yesterday and it is really good. He explains it perfectly.