Deep learning’s success hinges on optimization. SGD, Adam, RMSProp. These algorithms were designed by researchers through intuition, mathematical analysis, and extensive experimentation. But what if we removed the human from the loop entirely?
This project takes a different approach: treat optimizer design as a search problem and let genetic algorithms discover what works.
Table of contents
Open Table of contents
First, What Even Is an Optimizer?
Think of training a neural network like trying to find the lowest point in a vast, dark mountain range while blindfolded. You can only feel the slope directly under your feet. An optimizer is your strategy for deciding which direction to step next.
The naive approach: always step downhill in the steepest direction. That’s basic gradient descent. But it’s slow and gets stuck easily.
Over decades, researchers developed smarter strategies. Momentum remembers which direction you’ve been heading and keeps some of that motion. Like a ball rolling downhill that doesn’t stop immediately when the ground flattens. Adam tracks both the direction and how bumpy the terrain has been, adjusting step sizes accordingly.
These optimizers power everything from ChatGPT to image generators. Small improvements in optimization translate to faster training, better models, and lower compute costs.
The Core Hypothesis
Hand-designed optimizers encode human assumptions. Adam assumes that adapting learning rates per-parameter via second moments is useful. RMSProp assumes gradient magnitude normalization helps. SGD with momentum assumes that accumulating gradient history smooths the optimization landscape.
In simpler terms: every optimizer we use today exists because a researcher had a hunch, wrote down some math, and tested it. Their intuitions were good. But intuition has limits.
These assumptions work. But are they optimal? Are there combinations we haven’t considered?
The system here encodes optimizers as “genomes”: linear combinations of primitive update terms with learned hyperparameters. Evolution explores the space of possible optimizers through selection, crossover, and mutation.
What’s a Genetic Algorithm?
Genetic algorithms borrow from biology. Instead of a researcher sitting down to design an optimizer, we:
- Create a population of random optimizer candidates
- Test each one on an actual task (training a neural network)
- Select the best performers as “parents”
- Breed offspring by combining traits from two parents
- Mutate slightly to introduce variation
- Repeat for many generations
It’s survival of the fittest, applied to algorithms. Bad optimizers die off. Good ones reproduce. Over generations, the population improves.
This isn’t new. Genetic algorithms have existed since the 1970s. What’s interesting is applying them to discover optimization algorithms themselves.
How the Genome Works
To evolve optimizers, we need to encode them in a format that can be mutated and crossed over. This project represents each optimizer as a “genome” with two parts:
The recipe (which ingredients to combine):
GRAD: The raw slope at your current positionMOMENTUM: A running average of recent slopes (smooths out noise)RMS_NORM: The slope, but normalized by how bumpy things have beenADAM_TERM: Combines momentum and normalization (what Adam uses)SIGN_GRAD: Just the direction of the slope, ignoring magnitude
The settings (how much of each ingredient, and tuning knobs):
- Learning rate: How big each step is
- β1, β2: How much history to remember
- Weight decay: A regularization trick that prevents overfitting
An optimizer genome might say: “Use 70% ADAM_TERM and 30% SIGN_GRAD, with learning rate 0.001 and β1=0.9.”
Adam, the most popular optimizer today, is just ADAM_TERM with specific β1, β2, and bias correction enabled. SGD with momentum is MOMENTUM with appropriate settings. But nothing stops evolution from combining SIGN_GRAD with RMS_NORM at a 0.3:0.7 ratio with unusual decay coefficients. Combinations no human ever tried.
The Evolution Loop
Looking at the main GA loop, the process follows a standard evolutionary template:
Generation 1/15
Evaluated 5/20...
Evaluated 10/20...
Best fitness: 0.8234 (val_acc: 0.8234)
Each generation:
- Evaluates every genome by training a small CNN on FashionMNIST
- Selects parents via tournament selection
- Applies crossover and mutation
- Preserves elites unchanged
The fitness function is validation accuracy after N training steps. Divergence (NaN, Inf, loss explosion) yields fitness=-1, immediately killing unstable configurations.
Translation: if an optimizer causes the neural network to explode into nonsense numbers (a common failure mode), it gets the worst possible score and won’t reproduce.
What Makes This Approach Interesting
Seeding with known optimizers. The initial population includes SGD, Adam, and RMSProp genomes. This isn’t cheating. It’s pragmatic. Evolution can improve on known solutions rather than rediscovering them from scratch.
Think of it like starting a breeding program with champion racehorses rather than random horses off the street.
Multi-task evaluation. The run_ga_multi_task function evaluates genomes across FashionMNIST, CIFAR-10, and MNIST simultaneously. Fitness becomes the mean across tasks. This pressures evolution toward generalizable optimizers rather than task-specific overfitting.
In other words: we don’t want an optimizer that’s amazing at recognizing handwritten digits but terrible at everything else. Testing on multiple tasks keeps evolution honest.
Proper experimental methodology. The system maintains strict train/val/test splits. Evolution uses validation accuracy for selection. Test accuracy is computed exactly once, at the end, on the best genome. This prevents the classic mistake of optimizing for test performance.
Why does this matter? Imagine a student who memorizes past exam answers. They’ll ace those specific questions but fail on new ones. Similarly, if we let evolution “see” the test data during selection, it might find optimizers that happen to work on that specific test rather than optimizers that genuinely work well. The held-out test set is the final exam that nobody gets to study for.
# Final evaluation on TEST set (only done once, after evolution)
final_fitness, final_metrics = evaluate_genome_multi_seed(
best_genome_ever,
train_loader,
val_loader,
device=device,
num_seeds=3, # Always use 3 seeds for final evaluation
test_loader=test_loader, # NOW we use test set
)
The Baseline Comparison
A discovered optimizer is only interesting if it beats existing ones. But comparisons must be fair.
The compare_with_baselines function runs a hyperparameter sweep for SGD, Adam, and RMSProp before comparing against the evolved optimizer. This matters. An evolved optimizer with tuned learning rate shouldn’t be compared against Adam with default settings.
lr_options = [-4.0, -3.0, -2.0] # 0.0001, 0.001, 0.01
wd_options = [-5.0, -4.0, -3.0] # 0.00001, 0.0001, 0.001
Each baseline gets 9 configurations tried. The best is selected via validation accuracy, then evaluated on test. This is the fair comparison.
It’s like comparing a new chess engine against Stockfish, but making sure Stockfish is actually configured properly. Not running on easy mode.
Questions This Raises
The approach raises several interesting questions:
Can evolution escape local optima? The primitive set constrains what’s possible. If the optimal optimizer requires a term not in the primitive vocabulary, evolution can’t find it. The choice of primitives encodes assumptions just as strongly as hand-design does.
Think of it this way: if you’re breeding dogs, you can select for size, speed, temperament. But you’ll never breed a dog that can fly. The raw materials constrain outcomes. Same here: if the “perfect” optimizer needs some mathematical operation we didn’t include as a primitive, evolution will never find it.
Does multi-task pressure produce transferable optimizers? Training on FashionMNIST, CIFAR-10, and MNIST simultaneously should discourage task-specific solutions. But these datasets share characteristics. All image classification, similar scales. Would an optimizer evolved here transfer to NLP or reinforcement learning?
What’s the computational cost tradeoff? Each fitness evaluation requires training a neural network. With 20 individuals, 15 generations, that’s 300 training runs minimum. The search is expensive. Whether the discovered optimizers justify that cost depends on downstream reuse.
If an evolved optimizer is 2% better than Adam but costs 300 training runs to discover, you’d need to use it on 15,000+ future training runs to break even. The economics matter.
The Structural Mutations
Beyond simple Gaussian noise on hyperparameters, the mutation operator can add, remove, or change primitive terms. This enables structural exploration. An optimizer can start as modified SGD and evolve toward something Adam-like, or vice versa.
Most mutations are small nudges: learning rate shifts from 0.001 to 0.0012. But occasionally, a mutation adds an entirely new term to the recipe or removes one. This is how evolution explores fundamentally different optimizer structures, not just variations on a theme.
The crossover operator handles this heterogeneity through single-point crossover on term lists and uniform crossover on scalar hyperparameters. Two parent optimizers with different structures can produce viable offspring.
Conclusion
This project treats optimizer design as an evolutionary search problem. Rather than deriving update rules from first principles or intuition, it lets selection pressure discover what works.
The implementation is methodologically sound. Proper train/val/test splits, fair baseline comparisons, multi-seed evaluation for robustness. The genome representation is expressive enough to encode known optimizers while allowing novel combinations.
Whether evolution can consistently outperform hand-designed optimizers remains an empirical question. But the framework itself represents a different way of thinking about the problem: optimization algorithms as artifacts to be discovered rather than designed.
The researchers who created Adam spent years developing intuition about what might work. Evolution has no intuition. Just brute-force trial and error across generations. Sometimes that’s exactly what you need.