Geoffrey Hinton is one of the founding figures of modern artificial...
Hinton is pointing out that in some of the earlier work in neural n...
Horace Barlow was a neuroscientist known for foundational work on s...
This is the sigmoid function, a common activation function in early...
Monomorphemic words are single-morpheme items with no prefixes or s...
Hinton notes that connections within a layer or from higher to lowe...
In other words, Hinton is asking: After training, what do the nodes...
Here is another representation of the five layer network depicted i...
Hinton is claiming that If your learning algorithm only paid attent...
What Hinton is saying back then is that we lacked a clear-cut mathe...
Quick refresher on the chain rule: The chain rule is a fundamental...
Here Hinton gives an overview of back-propagation, a learning algor...
To compute $\frac{\partial E}{\partial x_j}$, we need the derivativ...
This expression defines the mean squared error (MSE), a common loss...

Discussion

This is the sigmoid function, a common activation function in early neural networks. It transforms the input $x_j$ into an output $y_j$ between 0 and 1: $$y_j = \frac{1}{1 + e^{-x_j}}$$ Hinton uses it because it introduces non-linearity, which is essential for the network to learn complex patterns. The sigmoid is smooth, bounded, and differentiable, making it well-suited for training with gradient descent. Its derivative is easy to compute, which simplifies backpropagation - a key feature in early neural net research. ![](https://i.imgur.com/9kQWXCi.png) *Sigmoid function* Hinton is pointing out that in some of the earlier work in neural network models, generalization (the ability of a model to correctly respond to new, unseen inputs) seemed impressive - but was, in fact, often somewhat hand-engineered. It relied on researchers manually constructed the input patterns for different concepts in a way that made generalization easier or even inevitable. Horace Barlow was a neuroscientist known for foundational work on sensory processing by the brain. He proposed that the brain might encode concepts using highly selective neurons to reduce redundancy - a view known as localist representation. This idea gave rise to the famous “grandmother cell” metaphor: a hypothetical neuron that fires only when you see your grandmother. Hinton contrasts this view with distributed representation, where concepts are encoded as patterns across many neurons, making the system more flexible, generalizable, and robust. Monomorphemic words are single-morpheme items with no prefixes or suffixes (e.g., dog, run, big). Synonyms of this kind (e.g., big vs. large) usually look and sound unrelated, so a word’s surface form offers almost no semantic clue. Hence a model must learn internal distributed codes that place related meanings near one another - Hinton’s core argument for distributed representations. This expression defines the mean squared error (MSE), a common loss function in neural networks. It measures how far the network’s output $y_{jc}$ is from the desired output $d_{jc}$ for each output unit $j$ and training case $c$. The total error $E$ is the sum of these squared differences, scaled by $\frac{1}{2}$ for mathematical convenience during differentiation. While Hinton calls it “error,” this is what we now commonly refer to as the cost function or loss, which backpropagation aims to minimize during training. In other words, Hinton is asking: After training, what do the nodes in the middle layers represent? Here is another representation of the five layer network depicted in figure 5 ![](https://i.imgur.com/zQD0m4f.jpeg) Here Hinton gives an overview of back-propagation, a learning algorithm first popularized by Rumelhart, Hinton, and Williams in their influential 1986 paper, "Learning representations by back-propagating errors" (Nature, 1986). Although the core idea had previously been explored by Paul Werbos and others, the algorithm’s significance lies in providing an efficient method for training multilayer networks by systematically propagating errors backward to update weights. This breakthrough allowed neural networks to effectively learn complex internal representations, marking a turning point in neural network research. Hinton is claiming that If your learning algorithm only paid attention to which input units light up at the same time as which output units during training, it would not do well on the family-tree task. Why? Because in the test questions, the person you feed in as person 1 has never been paired with the right person 2 in the training data. In fact, it’s been paired with other outputs, so the simple input-output correlation is actually negative. The rule you need to discover (e.g. ‘person 2 is person 1’s aunt’) is a relational rule; you won’t find it by looking at one-to-one correlations between individual units. You need to learn the hidden structure that ties roles and relationships together. What Hinton is saying back then is that we lacked a clear-cut mathematical recipe that, no matter what data you gave it, always told you the best way to generalize to new cases. Because that high-level rule didn’t exist, Hinton had to rely on demonstrations - showing that his network behaved sensibly on one problem rather than proving it would always do so. ### Where we are now Theory has filled in parts of the picture: we have frameworks that can guarantee good generalization if you first spell out some assumptions (for example, limit the kinds of models considered or state how “simple” explanations are preferred). What we still don’t have is a single assumption-free formula that works for every dataset and every model. Hinton’s warning therefore remains valid today. Quick refresher on the chain rule: The chain rule is a fundamental rule in calculus for computing the derivative of a composition of functions. When a variable depends on another variable, which in turn depends on a third, the chain rule tells us how to compute the derivative across those dependencies. If $E$ depends on $y_j$, and $y_j$ depends on $x_j$, then: $$ \frac{\partial E}{\partial x_j} = \frac{\partial E}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_j} $$ This is how gradients are passed backward through layers in a neural network—by chaining partial derivatives together. To compute $\frac{\partial E}{\partial x_j}$, we need the derivative of the sigmoid function from Equation 2: $$ y_j = \frac{1}{1 + e^{-x_j}} $$ The key property of the sigmoid is that its derivative can be written in terms of its output: $$ \frac{dy_j}{dx_j} = y_j (1 - y_j) $$ This follows from applying the quotient and chain rules during differentiation. Plugging this into the chain rule gives: $$ \frac{\partial E}{\partial x_j} = \frac{\partial E}{\partial y_j} \cdot y_j (1 - y_j) $$ This is the gradient of the error with respect to the input of the sigmoid unit. Hinton notes that connections within a layer or from higher to lower layers are forbidden, reflecting the architecture of early feedforward neural networks. These models only allowed connections to flow in one direction: from input to hidden to output layers. This constraint simplified training and analysis but limited the network’s ability to represent temporal or contextual dependencies. Later, researchers began exploring recurrent neural networks (RNNs), which allow cycles and connections back to earlier layers or within a layer. These emerged notably in the late 1980s and early 1990s -Jordan networks (1986) and Elman networks (1990) were among the first to introduce feedback connections for modeling sequential data. These architectures broke the strict layer-to-layer constraint and enabled networks to maintain memory and model time-varying patterns, laying the foundation for modern RNNs and LSTMs. ![](https://i.imgur.com/ejT5d2z.png) Geoffrey Hinton is one of the founding figures of modern artificial intelligence. Often referred to as the “Godfather of Deep Learning,” Hinton has been instrumental in shaping the trajectory of neural networks. Geoffrey Hinton had an unconventional academic path—he studied experimental psychology at Cambridge, spent a year as a carpenter’s apprentice, explored interests in physiology and philosophy, and eventually earned a Ph.D. in artificial intelligence from the University of Edinburgh, where he began fusing ideas from neuroscience and computation. This paper was published during Hinton’s tenure at Carnegie Mellon University. At the time, symbolic AI dominated the field, emphasizing rule-based systems and logical inference. Hinton and a small group of researchers - including David Rumelhart and Ronald J. Williams - pushed back against this orthodoxy, arguing that intelligence could emerge from networks of simple, neuron-like units trained through gradient descent. This work introduced key ideas around distributed representations - where concepts are not stored in a single unit or symbol, but are encoded across patterns of activity in many units. This is the emphasis of this paper. The significance of this paper lies in its foresight. It anticipated much of what would later become foundational in deep learning: the use of distributed, high-dimensional embeddings; the idea that representations are learned, not hand-coded; and the notion that similarity in meaning corresponds to similarity in representation. ![](https://i.imgur.com/WsCOrDD.png) *Hinton*