← Back to Timeline

Generating Interpretable Networks using Hypernetworks

Foundational AI

Authors

Isaac Liao, Ziming Liu, Max Tegmark

Abstract

An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the easier encoding problem, i.e., to convert an interpretable algorithm into network weights. Previous works focus on encoding existing algorithms into networks, which are interpretable by definition. However, focusing on encoding limits the possibility of discovering new algorithms that humans have never stumbled upon, but that are nevertheless interpretable. In this work, we explore the possibility of using hypernetworks to generate interpretable networks whose underlying algorithms are not yet known. The hypernetwork is carefully designed such that it can control network complexity, leading to a diverse family of interpretable algorithms ranked by their complexity. All of them are interpretable in hindsight, although some of them are less intuitive to humans, hence providing new insights regarding how to "think" like a neural network. For the task of computing L1 norms, hypernetworks find three algorithms: (a) the double-sided algorithm, (b) the convexity algorithm, (c) the pudding algorithm, although only the first algorithm was expected by the authors before experiments. We automatically classify these algorithms and analyze how these algorithmic phases develop during training, as well as how they are affected by complexity control. Furthermore, we show that a trained hypernetwork can correctly construct models for input dimensions not seen in training, demonstrating systematic generalization.

Concepts

interpretability hypernetworks algorithmic phase discovery automated discovery phase transitions weight space generalization sparse models loss function design inverse problems symmetry preservation convolutional networks

The Big Picture

Imagine trying to understand how a calculator works by only looking at its output. No buttons, no display, just the answer. That’s roughly what researchers face when trying to decode how a neural network solves a problem. Mechanistic interpretability is the field devoted to cracking open these networks and reading the algorithm hidden inside.

There’s a catch: most existing work starts with a known algorithm and builds a network specifically designed to reproduce it. That guarantees interpretability, but it also guarantees you’ll never discover anything genuinely new.

A team from MIT (Isaac Liao, Ziming Liu, and Max Tegmark) took the opposite approach. Instead of asking “how do we build a network that implements this algorithm?”, they asked: “Can we build a system that generates interpretable networks, and then figure out what those algorithms are?” The answer was yes, and the algorithms it found included ones nobody had thought to look for.

Their tool is a hypernetwork: a neural network that designs the internal parameters of another, smaller neural network. By training this hypernetwork carefully, the team produced a family of tiny networks that all solve the same simple problem, computing the L1 norm (the sum of absolute values of a list of numbers), but through surprisingly different and previously unknown methods.

A hypernetwork can systematically explore the space of interpretable algorithms. Some of the computational strategies it finds are ones humans never thought to try, yet they still make sense in hindsight.

How It Works

The researchers chose a deliberately simple test case: the L1 norm. It sounds almost trivial. But simplicity was the point. If you can’t fully understand how a network computes something this basic, what hope is there for complex tasks?

The target networks are compact two-layer MLPs (multi-layer perceptrons) with a small number of hidden neurons and a single output. Conventional training produces networks that technically solve the problem but look like noise when you inspect their weights. No discernible structure, no interpretable pattern.

The hypernetwork changes this in two ways. First, it generates weights with structured, repeating motifs that act as fingerprints of an underlying algorithm. Second, it includes a parameter β that penalizes unnecessary complexity, pushing networks toward simpler solutions. By sweeping β from high values (simpler networks) to low values (more complex ones) across many random seeds, the team assembled a diverse library of models for analysis.

Figure 1

To read the algorithms out of these networks, the team used force-directed graph drawings, a visualization technique that arranges neurons on a plane according to their connection strengths. Symmetries become visible at a glance. Three distinct algorithms emerged:

  • The double-sided algorithm: The expected one. Build an absolute value from two ReLU-like neurons, one firing for positive inputs and one for negative, then sum across all dimensions. Clean, familiar, unsurprising.
  • The pudding algorithm: Not expected. This method squeezes the L1 computation through a different algebraic path, and it comes in two signed variants (“positive pudding” and “negative pudding”) that look visually distinct in the force-directed drawings.
  • The convexity algorithm: Also unexpected. This one exploits convexity properties of the absolute value function through yet another computational route.

Only the double-sided algorithm was anticipated before the experiments ran. The other two emerged purely from what the hypernetwork chose to do.

Why It Matters

The real point here isn’t that a hypernetwork found two new algorithms for computing absolute values. It’s what this tells us about the space of interpretable algorithms. Neural networks, even tiny ones, don’t necessarily rediscover human-familiar methods. They find their own paths. Those paths make sense once you see them, but nobody had explored them before.

This has practical consequences for mechanistic interpretability. Right now, researchers hunt for structure in networks that weren’t designed to be interpretable. The MIT team’s approach offers an alternative: use hypernetworks to generate and catalogue interpretable algorithms, building a reference library that could guide future decoding efforts.

The trained hypernetwork also generalizes to input dimensions it never saw during training, constructing correct models for vector sizes outside the training distribution. That kind of generalization is hard to explain by memorization alone. The hypernetwork appears to have learned something genuinely algorithmic about how to solve L1 norm problems.

A bigger question sits behind all of this. When neural networks solve math problems, do they rediscover human algorithms or invent alien ones? The pudding and convexity algorithms suggest the latter is entirely possible, even for elementary computations. Understanding those alien algorithms, rather than forcing networks into human-shaped molds, may matter a great deal for building trustworthy AI.

The MIT team trained a hypernetwork to generate interpretable networks and discovered two entirely new algorithms for computing absolute values. Nobody had thought to look for them. Neural networks may have a richer algorithmic repertoire than we’ve assumed.

IAIFI Research Highlights

Interdisciplinary Research Achievement
This work connects physics-inspired tools like complexity control and phase transitions to a core problem in AI safety and interpretability, treating algorithmic discovery as a structured search over a phase space of neural network behaviors.
Impact on Artificial Intelligence
The hypernetwork framework gives mechanistic interpretability researchers a systematic way to generate and catalogue interpretable algorithms, letting them study what computations neural networks actually perform rather than guessing.
Impact on Fundamental Interactions
The discovery of phase transitions between algorithmic regimes, triggered by varying β or by training dynamics, brings a statistical-mechanics perspective to how networks settle into computational strategies, tying physics and machine learning together in a concrete way.
Outlook and References
Future work could scale this approach to more complex tasks and larger networks, testing whether the alien-algorithm phenomenon persists at scale. The paper is available at [arXiv:2312.03051](https://arxiv.org/abs/2312.03051) (Liao, Liu, Tegmark, MIT).

Original Paper Details

Title
Generating Interpretable Networks using Hypernetworks
arXiv ID
2312.03051
Authors
Isaac Liao, Ziming Liu, Max Tegmark
Abstract
An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the easier encoding problem, i.e., to convert an interpretable algorithm into network weights. Previous works focus on encoding existing algorithms into networks, which are interpretable by definition. However, focusing on encoding limits the possibility of discovering new algorithms that humans have never stumbled upon, but that are nevertheless interpretable. In this work, we explore the possibility of using hypernetworks to generate interpretable networks whose underlying algorithms are not yet known. The hypernetwork is carefully designed such that it can control network complexity, leading to a diverse family of interpretable algorithms ranked by their complexity. All of them are interpretable in hindsight, although some of them are less intuitive to humans, hence providing new insights regarding how to "think" like a neural network. For the task of computing L1 norms, hypernetworks find three algorithms: (a) the double-sided algorithm, (b) the convexity algorithm, (c) the pudding algorithm, although only the first algorithm was expected by the authors before experiments. We automatically classify these algorithms and analyze how these algorithmic phases develop during training, as well as how they are affected by complexity control. Furthermore, we show that a trained hypernetwork can correctly construct models for input dimensions not seen in training, demonstrating systematic generalization.