@d-healey It's called Deep Learning for a reason 😉
WaveNet: Feedforward network that learns to estimate a function from training data (one of the simplest networks), original paper from 2016
AutoEncoder: Takes a complex representation of data (usually a MelSpectrogram), simplifies (encodes) it, then learns to reconstruct the original complex representation from the encoded information instead. Useful for data compression, more useful when you make it variational
Variational AutoEncoder (VAE): Uses a fancy reparametrization trick to take each complex representation and "map" it to a latent dimension, the network can then decode from any arbitrary location inside that latent dimension. Think of it like burying treasure, you can bury one piece in the sandbox, one under the tree, one next to the car -- you then say "hey stupid neural network idiot, decode the stuff under the car" and it will give you (hopefully) exactly what was "mapped" to that location
Generative Adversarial Network (GAN): Uses one network (generator) to "trick" the other network (discriminator/critic), each network gets better at doing its job (tricking / not being tricked) until the generator gets so good at generating that it's indiscriminable, Deepfakes are a type of GAN iirc
Diffusion Model: Trains by taking a data sample and adding noise to it incrementally, it learns by understanding what the previous noise-step was, then you pass it a random noise value and it decodes something from that. Dalle and StableDiffusion are diffusion models, they aren't necessarily ideal for audio since they're quite slow at inferrence time
DDSP: differentiable digital signal processing, they basically train a neural network to control audio effects (or a synthesizer), by letting the network take the wheel it doesn't have to actually generate the raw audio data and instead just controls existng DSP tools
RAVE: realtime audio variational autoencoder, a fancy VAE that also includes a GAN stage at training. it's the current state-of-the-art for realtime synthesis and is embeddable on micro devices and such. I have no idea where to begin implementing it in hise as it's quite complex
Implementation: you basically need a fast library and some sort of time-series network, the former can be RTNeural (installing Tensorflow / torch is kinda annoying compared to the simple RTNeural library), the latter can be a LSTM, RNN, GRU or a convnet. It also must be trained (obviously)
all of these time-series models basically "predict" the next audio sample or buffer, based on the previous one(s). without temporal coherence it will just randomly predict each sample, which results in white noise (ie my problem in my VAE)
There's also guys like the dadabots dudes who generate full music instead of individual instruments/effects, they have a 24-hour metal livestream that is being generated in realtime which is really cool
you can find all sorts of tutorials on how to build simple neural nets on youtube using Python and a library like Keras. Be warned: you'll be looking at hand-drawn MNIST images for hours at a time
okay that's my last year and a half condensed into an ugly forum post 🙂