$The 50-Year Compile: How 1970s Math Defeated GOFAI and Built Modern AI$

AI & Agents 17 May 2026 · 12 min

The 50-Year Compile: How 1970s Math Defeated GOFAI and Built Modern AI

By Timothy Mo · 17 May 2026

In 1974, Paul Werbos typed his Harvard PhD thesis on a machine that could not run it. The document, titled Beyond Regression, contained a derivation that would take twelve years to matter and fifty to dominate: a general method for computing partial derivatives across multi-layer systems using the chain rule, applied recursively from output to input. No funding body wanted it. No compute cluster existed to test it meaningfully. It sat in the archive — a compressed executable with no compatible runtime in sight.

The math arrived decades before the machines. That gap is the central tragedy, and the central lesson, of the algorithms that built modern AI.

The GOFAI Wall and the Mathematical Heresy of 1974

Backpropagation

step 01

Forward Pass

step 02

Calculate Error

step 03

Backward Pass

step 04

Adjust Weights

By the mid-1970s, the dominant faith in AI was symbolic. “Good Old-Fashioned AI,” as the philosopher John Haugeland would later name it, held that intelligence was a matter of manipulating symbols according to explicit rules. The expert system was its cathedral. If you could write down every rule a doctor followed, you could build a diagnostic system. Thought, in this framework, was syntax. Meaning was structure. Learning was, at most, a matter of extending the rule book.

The 1969 book Perceptrons by Marvin Minsky and Seymour Papert delivered what felt like the fatal blow to the neural alternative. Minsky and Papert proved, rigorously, that a single-layer perceptron could not solve the XOR problem — the elementary logical operation that returns true only when its inputs differ. Not a subtle limitation. A machine that cannot learn XOR cannot model many of the structured relationships in the world. The proof was elegant, and it landed at exactly the wrong moment: funding agencies read it as a verdict, not a scope limitation. Neural networks were dead, the establishment decided.

The trouble is, what Minsky and Papert had actually proven was narrower than what everyone heard. A multi-layer network solves XOR trivially. The question was how to train such a network — how to push the error signal backwards through the hidden layers, adjusting every weight in proportion to its contribution to the mistake.

Werbos answered this in Beyond Regression. His derivation applied the chain rule — that the derivative of a composed function is the product of the derivatives of each part:

$\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial w_i}$

— recursively backwards through each layer of a computational graph. If the network’s output $\hat{y}$ differs from the target $y$ by a loss $L$ , Werbos showed you could propagate $\frac{\partial L}{\partial w}$ through every parameter in the system without approximation. The gradients were exact. The math was clean. The procedure was algorithmic.

The establishment largely ignored it — not because it was wrong, but because it was ideologically foreign and practically unverifiable. The dominant framework was rule-based and discrete. Werbos’s mathematics was continuous, differentiable, and gradient-driven: a different universe of thought. In the theology of 1974 AI, you didn’t need calculus to build intelligence. You needed logic.

What Werbos had written was not merely a paper. It was dormant source code, waiting for a runtime that didn’t exist.

Simulating Darwin in Fortran: Holland’s Biological Compute

Evolutionary Search

$Researcher reviewing continuous-feed paper printouts in a quiet computing archive.$

One year later, a University of Michigan professor named John Holland published Adaptation in Natural and Artificial Systems. Where Werbos had looked inward — to the calculus of errors propagating through a fixed computational graph — Holland looked outward, to biological populations competing under selection pressure.

The Genetic Algorithm Holland formalized is simple enough to state in six lines: create a population of candidate solutions, evaluate each by a fitness function, select the strongest survivors, recombine them through crossover, introduce random mutations, and repeat until convergence. The biological metaphor is almost distractingly clean. But what made Holland’s contribution mathematically serious was his Schema Theorem — a formal result arguing that genetic algorithms are not merely inspired by evolution but are, in a measurable information-theoretic sense, efficient searchers over combinatorial spaces.

The Schema Theorem establishes that short, highly-fit “building blocks” (schemata) in the population receive exponentially increasing trials across generations. In Holland’s framing, a population of $N$ strings implicitly processes on the order of $O(N^3)$ schemata simultaneously — a claim that has been refined and debated in later literature, but whose core insight holds: stochastic, population-based search can efficiently explore high-dimensional, discontinuous fitness landscapes where gradient-based methods cannot function — landscapes with no derivatives to follow.

That last point was everything in 1975. Gradient-based methods like Werbos’s backpropagation were computationally starved and theoretically marginalized. Holland’s approach asked nothing of the fitness landscape. It required no differentiability, no Lipschitz continuity. It only required that you could evaluate a candidate — however jagged the surrounding terrain.

But simulating this on 1975 hardware was its own kind of penance. Holland’s lab worked with early Fortran implementations on IBM mainframes. To run even a modest genetic search — a population of 100 binary strings of length 64, through 500 generations — required punching decks of cards, waiting for batch processing slots, and collecting printouts hours later to assess convergence. The hardware was not merely slow. It was psychologically hostile to iteration. Each experimental run was a diplomatic cable sent into the unknown, answered days later on tractor-feed paper.

And yet Holland persisted, because the math was right. By maintaining population diversity through mutation and crossover, the algorithm avoids the premature convergence that haunts deterministic search. It doesn’t hill-climb; it spreads across the landscape, triangulating optima from multiple directions simultaneously. It is the search strategy of an organism that doesn’t know the shape of the territory — which, Holland recognized, is almost always the condition that matters.

The 1986 Breakthrough and the VAX-11/780 Bottleneck

In October 1986, Nature published “Learning representations by back-propagating errors” by David Rumelhart, Geoffrey Hinton, and Ronald Williams. The paper demonstrated, with working experiments, that multi-layer networks trained by backpropagation could learn internal representations — distributed patterns of activation in hidden layers that encode features the programmer never specified. Networks trained to predict vowel sounds spontaneously organized their hidden layers by acoustic similarity. The math Werbos had written in 1974 was finally speaking, twelve years late.

The weight update at each iteration followed the gradient descent rule:

$w \leftarrow w - \alpha \frac{\partial L}{\partial w}$

where $\alpha$ is the learning rate and $\frac{\partial L}{\partial w}$ is the backpropagated gradient. Each training pass required two traversals: a forward pass to compute output and loss, and a backward pass to compute gradients for every parameter. For a network with $L$ layers and $n$ neurons per layer, each backward pass performs on the order of $O(L \cdot n^2)$ floating-point multiplications.

On a DEC VAX-11/780 — the workstation of choice in the mid-1980s academic AI lab, delivering roughly 1 MFLOP — this arithmetic was punishing. A network with three hidden layers of 100 neurons each, trained on 10,000 examples for 100 epochs, required on the order of $10^{11}$ floating-point operations. At 1 MFLOP, that is approximately $10^5$ seconds — over a day, for a network too shallow to reliably recognize a handwritten digit.

The 1986 paper won the theoretical argument. But it lost the practical war almost immediately. Researchers who tried to scale the approach hit the silicon ceiling within months. Memory constraints forced network sizes down. CPU cycles made training runs that should have taken hours stretch into weeks. The math was sound. The hardware was an insult to it.

The second AI winter descended. Neural networks retreated to the margins again. Critics pointed to the practical failures of deep networks and argued the fundamental approach was misguided. They were wrong about the cause and right about the symptom. The problem wasn’t the gradient. The problem was the clock.

The Hardware Awakening: Vectorization Rescues the Math

$Technical illustration of parallel processing streams in a matrix grid architecture.$

The video game industry did not set out to save neural networks. Nvidia’s engineers designing the GeForce 256 in 1999 were solving a different problem: how to render thousands of three-dimensional triangles per frame, in real time, to satisfy players who wanted their explosions to look convincing. The GPU architecture they built was a massively parallel array of small, identical processing units, each executing the same operation on different data simultaneously — Single Instruction Multiple Data.

Nobody in the gaming division was thinking about Werbos. But the architecture was an almost perfect match for the matrix operations at the heart of backpropagation.

The forward pass of a neural network layer is, essentially, a matrix multiplication: $\mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b})$ , where $W$ is the weight matrix, $\mathbf{x}$ is the input, and $\sigma$ is the activation function applied elementwise. The backward pass computes gradients through the transpose: $\frac{\partial L}{\partial \mathbf{x}} = W^T \frac{\partial L}{\partial \mathbf{a}}$ . Both operations are dense linear algebra — exactly the computation that GPU shader pipelines had been optimized to perform at scale.

When Nvidia released CUDA in 2006, giving researchers direct programmatic access to GPU compute, the collision between Werbos’s 1974 mathematics and 21st-century silicon became inevitable. By 2012, the convergence was undeniable: Krizhevsky, Sutskever, and Hinton’s AlexNet, trained on two consumer GPUs, reduced the ImageNet top-5 error rate from around 26% to 15.3% — a margin that shattered the competition and forced the field to pay attention in a way no theoretical argument had managed. Architectures with ten, twenty, fifty layers — depths that had been theoretically interesting but practically inert for decades — suddenly became trainable.

The asymmetry is worth dwelling on. The rules-based logic of GOFAI — chains of if-then conditionals, symbolic pattern matching, recursive tree search — maps poorly to GPU architecture. Branching is the enemy of vectorization; each conditional forces different threads down different execution paths, destroying the parallelism that makes GPUs fast. Gradient descent over differentiable functions, by contrast, is structurally parallelizable. The same weight-update computation repeats identically across every neuron, every sample in a minibatch, every layer. It was as if the mathematics had been waiting for a specific shape of hardware — and that shape had been built, for entirely unrelated reasons, by people trying to render fog effects in first-person shooters.

Beyond Gradients: Salimans, Evolution Strategies, and the Next Epoch

Optimization Matrix

Backpropagation

Evolution Strategies

Gradient Need

Exact (Differentiable)

None (Stochastic)

Parallelization

Sequential (Layered)

Highly Distributed

Target Use Case

Deep Feature Learning

RL & Meta-learning

The story does not resolve into a clean victory for backpropagation. In 2017, Tim Salimans and colleagues at OpenAI published “Evolution Strategies as a Scalable Alternative to Reinforcement Learning,” demonstrating that gradient-free, population-based optimization — directly descended from Holland’s 1975 framework — could match backpropagation-based reinforcement learning on standard benchmarks while distributing trivially across thousands of CPU cores.

Evolution Strategies work by perturbing a policy’s parameters with Gaussian noise, evaluating each perturbed version’s fitness, and computing a pseudo-gradient:

$\nabla_\theta \mathbb{E}[F(\theta)] \approx \frac{1}{n\sigma} \sum_{i=1}^{n} F(\theta + \sigma \epsilon_i) \cdot \epsilon_i$

where $\epsilon_i$ are random perturbations and $F$ is the fitness score. No backpropagation occurs. No computation graph is maintained. No environment need be differentiable. The approach simply asks: did this perturbation help? If yes, move in that direction. It is, at its core, the Schema Theorem instantiated in continuous parameter space.

Salimans et al. reported that their ES implementation trained a humanoid locomotion policy in ten minutes using 1,440 CPU cores — a task requiring hours with GPU-based backpropagation methods. Because each population member is evaluated independently, parallelization is embarrassingly simple: add machines linearly and receive linear speedup, with no gradient synchronization bottlenecks, no shared memory contention, no dependency chains between workers. Real advantages. But domain-specific ones. ES has not displaced gradient methods for supervised learning or large-scale language model training, where dense, differentiable loss signals make backpropagation decisive.

This is not a displacement of backpropagation. It is a clarification of the map. Gradient descent excels when the loss landscape is smooth, the reward signal is dense, and the architecture is differentiable end-to-end. ES and genetic variants excel when the environment is non-differentiable, the reward is sparse or episodic, or the optimization target involves discrete structural choices. Neural architecture search — the automated discovery of network topologies — is largely evolutionary in practice, because the space of possible architectures is discrete and non-differentiable by nature.

Today’s most capable systems increasingly run both modes simultaneously: Werbos’s chain rule for learning representations inside fixed architectures, Holland’s evolutionary search for discovering those architectures in the first place. The boundary between the two traditions has become productive and porous in ways neither originator could have anticipated while submitting their manuscripts to batch-processing queues.

The Compile Finishes — And Starts Again

$Systems engineer inspecting hardware cables in a high-density data center server aisle.$

What we call the “AI era” is, from one angle, a story about compile time — the agonizing period between a program being written and a machine powerful enough to run it finally arriving. Werbos had the chain rule in 1974. Holland had the Schema Theorem in 1975. Rumelhart, Hinton, and Williams demonstrated practical utility by 1986. None of it waited on a conceptual breakthrough. It waited on transistors.

That claim is most exact for the 1974–1986 cohort of algorithms. The decade after 2010 saw genuine architectural innovations — dropout regularization, batch normalization, attention mechanisms, residual connections — that mattered independent of raw compute. But the pattern of dormant math waiting on hardware has recurred often enough that I’ve stopped treating it as coincidence. It looks structural.

The math compiled slowly, on VAX servers delivering 1 MFLOP, on punch-card Fortran simulations that printed convergence curves on tractor-feed paper. Then it compiled faster, on CUDA cores designed to render game explosions. Then at scale, on data center clusters drawing megawatts of power to run tensor operations Werbos could have specified on a blackboard in 1974.

The implication for what follows is uncomfortable. The physical limits of GPU scaling are no longer hypothetical — power density, memory bandwidth, and lithographic resolution are all approaching walls that transistor count alone cannot breach. Neuromorphic hardware, encoding computation in the timing of electrical spikes rather than the magnitude of floating-point numbers, promises orders-of-magnitude improvements in energy efficiency for certain algorithm classes. Quantum processors offer superposition-based parallelism that could reshape the combinatorial search problems where evolutionary methods already excel.

The question worth sitting with is not what algorithms we will invent next. It is what algorithms already exist — correct and dormant in PhD theses and obscure conference proceedings — that our current hardware simply cannot yet compile. History gives a consistent answer: they are already written. The mathematicians have always been ahead of the machines. Every decade or two, the machines catch up, and we call it a revolution.

It is not a revolution. It is a build completing.

← All field notes Brief a crew →