Why AI research needs the philosophy of science
With the paradigms of mechanistic interpretability as a case study
Recently at work I had the opportunity to think about what I would like to see others writing about in AI. I advocated for real-time history and philosophy of science, conducted by philosophers embedded at the frontier of research.
In this post, I summarize a 2019 paper outlining four ways philosophers contribute to science, then highlight Lee Sharkey’s Kuhnian analysis of mechanistic interpretability as a case study of what this kind of work looks like in practice.
Why science needs philosophy
In 2019, a group of philosophers and scientists published “Why Science Needs Philosophy” in the Proceedings of the National Academy of Sciences. The authors argued that philosophy makes specific, concrete contributions to scientific research, and identified four main forms those contributions take.
Conceptual clarification. Scientists often use terms that seem clear but turn out, on examination, to conflate distinct phenomena. The philosopher Lucie Laplane dissected how scientists use “stemness,” an essential property of stem cells, and found the term actually refers to four different kinds of properties depending on tissue type and environment. Her analysis had practical consequences. Different cancer therapies implicitly target different kinds of stemness, so clarifying the concept helped explain why some therapies work for some cancers but not others. The stem cell researcher Hans Clevers called the philosophical analysis readily applicable to experimentation.
Critical assessment of assumptions. Scientific fields inherit frameworks from their predecessors, and sometimes these frameworks contain assumptions that go unexamined. The philosopher Thomas Pradeu critiqued immunology’s dominant “self/nonself” framework, which holds that immune systems react differently to the body’s own cells versus foreign invaders. But this framework struggles to explain why immune systems tolerate symbiotic gut bacteria (clearly foreign), sometimes attack tumors (the body’s own mutated cells), and sometimes attack healthy tissue (as in autoimmune diseases). Pradeu proposed an alternative “discontinuity theory” where immune systems respond to sudden changes in molecular patterns rather than the self/nonself distinction.
Formulation of new theories and concepts. Pradeu’s discontinuity theory is one example of a third kind of contribution: not merely critique but a new framework. The philosopher Jerry Fodor offers another. His 1983 book The Modularity of Mind proposed a new architecture for understanding cognition. It reshaped cognitive science so thoroughly that one psychologist described the field in terms of “before Fodor” and “after Fodor.”
Fostering dialogue. Philosophy can help foster dialogue between different sciences, as well as between science and society. The paper’s authors wrote: “Philosophy and science share the tools of logic, conceptual analysis, and rigorous argumentation. Yet philosophers can operate these tools with degrees of thoroughness, freedom, and theoretical abstraction that practicing researchers often cannot afford in their daily activities.” This makes philosophers useful interlocutors when fields need to communicate across disciplinary boundaries or when scientific developments raise questions that matter to the broader public.
Why the science of artificial intelligence needs philosophy
The authors ended with practical suggestions for reconnecting science and philosophy: make room for philosophy in scientific conferences and vice versa, host philosophers in labs and departments, co-supervise PhD students across disciplines, and include open sections devoted to philosophical and conceptual issues in science journals.1
They also ended with a warning: “Modern science without philosophy will run up against a wall: the deluge of data within each field will make interpretation more and more difficult, neglect of breadth and history will further splinter and separate scientific subdisciplines, and the emphasis on methods and empirical results will drive shallower and shallower training of students.”
They quoted the biologist Carl Woese: “A society that permits biology to become an engineering discipline, that allows science to slip into the role of changing the living world without trying to understand it, is a danger to itself.”
Reading this, I think of artificial intelligence. NeurIPS, the flagship AI and machine learning conference, received over 21,000 paper submissions in 2025, up from roughly 3,300 in 2017. The sheer scale strains the field’s capacity to evaluate its own output. A 2021 experiment found that more than half of papers recommended for spotlight presentations were rejected outright by a parallel committee reviewing the same work.
Machine learning is also perhaps the best contemporary example of a field where engineering capability has dramatically outpaced theoretical understanding. We can build systems that generate coherent text and perform complex reasoning, but our theoretical understanding of why deep learning works, how neural networks represent information internally, and what determines their capabilities remains limited.
This is not necessarily a scandal, as the history of science is full of cases where practical mastery preceded theoretical understanding. We had steam engines before thermodynamics, vaccines before germ theory, and selective breeding before Mendelian genetics. What would it look like for philosophy to engage with the emerging science of AI in the way Laplane engaged with oncology or Pradeu with immunology?
Consider mechanistic interpretability (“mech interp”), a research area within artificial intelligence that aims to reverse-engineer neural networks and understand how they work internally. In a May 2025 paper, researchers Kola Ayonrinde and Louis Jaburi called it “the strange science.” In natural science, we pursue mathematical formalism to describe what we observe. In mechanistic interpretability, we begin with fully specified mathematical objects and must still study them empirically.
The field has recently entered a period of visible crisis, with foundational concepts contested and AI labs reconsidering their approaches. At NeurIPS, one researcher characterized the field as still asking basic questions: “What does it mean to have an interpretable AI system?”
In June 2025, researcher Lee Sharkey offered an analysis of the paradigms the field has gone through, drawing on the work of Thomas Kuhn. Kuhn was a philosopher and historian of science whose 1962 book The Structure of Scientific Revolutions transformed how we think about scientific progress. Kuhn argued that science does not advance through steady accumulation of facts. Instead, it operates within “paradigms,” or shared frameworks of concepts, methods, and standards that define what counts as legitimate work in a field. Over time, observations that resist explanation within the paradigm begin to accumulate. Kuhn called these anomalies. When enough pile up, the field enters a crisis, and eventually a new paradigm emerges. Kuhn called a field “pre-paradigmatic” when it had not yet settled on any shared framework at all.
Sharkey made three of the four contributions the 2019 paper described. He clarified an underspecified concept, identifying “feature” as a term the field treats as fundamental but cannot adequately define. He critically assessed inherited assumptions, arguing that mechanistic interpretability imported its concepts and methods from computational neuroscience without examining whether they suit artificial networks. And he formulated an alternative theory, proposing an approach that takes mechanisms rather than features as the basic unit of analysis.
Mechanistic interpretability’s paradigms
Sharkey’s central claim is that mechanistic interpretability has never been pre-paradigmatic. Instead, the field inherited a well-developed framework from computational neuroscience and connectionism. Within that broader framework, mechanistic interpretability has gone through two distinct “waves,” or mini paradigms, with their own concepts, methods, and crises. Sharkey suggested that a third wave may be emerging.
Sharkey documented that the field’s inheritance from computational neuroscience was extensive. Mechanistic interpretability imported the idea that neural networks represent things in the world, forming internal states (the patterns of activation at layers between input and output) that correspond to edges, textures, objects, or concepts. It imported the idea that these representations can be distributed across many neurons rather than localized in one, and that they form hierarchies, with early layers representing simpler features and later layers representing more abstract ones. It also imported the linear representation hypothesis, which holds that concepts are encoded as directions in the network’s internal space that can be added and subtracted. The classic demonstration is that “king” minus “man” plus “woman” yields “queen,” as if the network represents gender and royalty as separate directions that combine.
Mechanistic interpretability also imported methods. Neuroscientists use causal interventions to study the brain, lesioning tissue, cooling regions, or using optogenetics to see what happens when parts of the system are turned on or off. Interpretability researchers do the same thing to neural networks, ablating neurons or artificially boosting their activity.
Even the standards for what counts as a contribution carried over. In both fields, demonstrating that a particular neuron or group of neurons represents something interesting counts as legitimate scientific progress. Neuroscientists celebrated the discovery of the “Jennifer Aniston neuron,” a neuron that fires when a person sees pictures of Jennifer Aniston. Mechanistic interpretability researchers celebrated finding a “Donald Trump neuron” in an image classifier.
This inheritance was, in Sharkey’s view, somewhat unfortunate. Computational neuroscience had been around for decades without making enormous progress in understanding how brains compute. But a flawed paradigm for mechanistic interpretability was still a paradigm.
The first wave (2012–2021)
The first wave of mechanistic interpretability emerged alongside the deep learning revolution. In 2012, students of Geoffrey Hinton entered a major image recognition competition and dramatically outperformed every other approach using deep learning, a type of machine learning that uses multi-layered artificial neural networks. Within a few years, deep learning was everywhere. But their victory demonstrated that neural networks worked, not why they worked. Some researchers doubted these systems had any interpretable internal structure at all. Perhaps they were simply inscrutable black boxes.
First-wave researchers demonstrated otherwise. Artificial neural networks consist of simple computational units called neurons, organized into layers. Researchers design the overall architecture, but training determines what each neuron does. The network sees millions of examples and gradually adjusts its internal parameters until it performs well, learning representations that no one explicitly programmed. Early work studied what the intermediate layers of image classifiers were doing and visualized what different neurons were detecting: edges, textures, patterns, faces, objects. Neural networks were not opaque. They contained meaningful structure that humans could identify and describe.
Toward the end of this period, the field coalesced around a particular picture of how neural networks work. Networks detect features, or meaningful properties or concepts like edges, textures, or objects. Each neuron detects a particular feature, so features are the fundamental unit of analysis. Features are connected to one another by weights, or the learned numerical parameters that determine how strongly one neuron influences another. A set of features and the weights connecting them form a circuit, or a computational subgraph of the network that performs a particular function.
The crisis of the first wave
First-wave interpretability operated on a natural assumption: if we could identify what individual neurons detected, we could understand the network by understanding its parts. The hope was that neural networks had learned something like a clean internal vocabulary.
Sharkey identified polysemanticity as the anomaly that destabilized this first wave. Researchers discovered that individual neurons often responded to multiple unrelated things. Instead of a clean “cat neuron,” they found neurons that fired for cat faces, cat legs, and car grilles.
Researcher Chris Olah, widely considered the pioneer of mechanistic interpretability, emphasized the difficulty that polysemanticity posed to the research program. Interpretation is difficult when neurons do not correspond one-to-one with meaningful concepts. How can we understand what the network is doing if its basic units are semantically scrambled?
The second wave (2022–present)
The second wave explained polysemanticity with the superposition hypothesis: networks want to represent more features than they have neurons. Instead of giving each concept its own neuron, they pack concepts in at angles to each other, overlapping them in the same neural space. A consequence of superposition is that any individual neuron will respond to multiple features, which is exactly polysemanticity.
Chris Olah and colleagues first invoked superposition as a potential explanation for polysemanticity in 2018. In 2022, Nelson Elhage and researchers at Anthropic and Harvard published “Toy Models of Superposition,” which studied the phenomenon directly. They trained simplified networks to represent more features than they had dimensions and observed how the networks arranged them.
Elhage and colleagues found that sparsity made superposition possible. Most concepts are sparse, meaning they appear in only a small fraction of inputs. The concept “golden retriever” might be relevant to perhaps one in a thousand images; the concept “Eiffel Tower” even fewer. If the network needs to represent a thousand concepts but any given input involves only a handful of them, then overlapping representations rarely cause problems. Two concepts can share the same neural real estate as long as they seldom need it at the same time.
The paper also provided a geometric picture of how superposition works. Consider a neural network layer with 100 neurons. When the network processes an input, each neuron produces a number, so the layer’s activity can be described as a list of 100 numbers. A list of two numbers defines a point on a plane. A list of three numbers defines a point in ordinary three-dimensional space. By the same logic, a list of 100 numbers defines a point in a 100-dimensional space. We cannot visualize 100 dimensions, but the mathematics works the same way. Different inputs produce different points in the space of possible activity patterns, which is called activation space.
In the second wave, a feature is not a single neuron but a direction through activation space. When the network recognizes a cat, the activity pattern moves in a particular direction. When it recognizes a car, the pattern moves in a different direction. A polysemantic neuron is simply a neuron that responds to multiple concept-directions because they all pass through it at different angles.
If features were directions rather than neurons, researchers needed a method to identify those directions. The solution was the sparse autoencoder. A sparse autoencoder is a small neural network trained to take activations from a language model layer, expand them into a much larger set of possible directions, and reconstruct the original activations using only a few of those directions at a time. The hope was that each learned direction would correspond to a single clean concept, monosemantic even if the underlying neurons were not.
This approach generated substantial progress. In May 2024, Anthropic reported extracting features from Claude 3 Sonnet using a sparse autoencoder with 34 million learned directions. One feature activated specifically on references to the Golden Gate Bridge, responding not just to the name itself but to descriptions of the bridge, nearby San Francisco landmarks, and even images of the bridge, despite the sparse autoencoder having been trained only on text. When researchers artificially boosted this feature during text generation, the model began bringing up the Golden Gate Bridge unprompted and, at extreme values, claimed to be the bridge. The feature was not just interpretable but causally active in shaping model behavior.
The crisis of the second wave
Despite successes, anomalies accumulated. Sharkey identified several problems that the second wave could not resolve. The entire research program revolved around features, yet the concept was underspecified. Was a feature an arbitrary direction in activation space? An interpretable property that humans could recognize? A property of the input that a sufficiently large network would reliably dedicate a neuron to representing?
More problems surfaced from the practice of training sparse autoencoders. When researchers trained smaller autoencoders, features were absorbed into coarser-grained concepts. When they trained larger autoencoders, features split into finer-grained versions. A medium-sized autoencoder might find a “dog” feature; a larger one might split this into “golden retriever,” “poodle,” and “German shepherd.” An even larger one might split “golden retriever” further into “golden retriever puppy,” “wet golden retriever,” and “golden retriever in profile.” There was no natural stopping point, no size at which the autoencoder recovered the “true” features. Optimizing for sparsity did not guarantee recovering the features the original network used.
The geometry of features also posed unexplained puzzles. Some features were not single directions but multi-dimensional subspaces, which undermined the linear representation hypothesis the field had inherited. Researchers found that days of the week, months of the year, and years of the twentieth century all formed circular arrangements in activation space. When they visualized how a language model represented “Monday,” “Tuesday,” “Wednesday,” and so on, the days appeared arranged in a circle, with adjacent days next to each other and Sunday looping back to Monday.
Days of the week are cyclical, so a circular representation captures something true about them. But the superposition hypothesis explained why features might be packed at angles to each other, not why they would form circles reflecting the underlying structure of concepts.
Most troubling were the mixed results when researchers tried to use sparse autoencoder features for practical tasks. If features were the fundamental units of neural network computation, then identifying them should enable applications like detecting deception or steering model behavior. Some applications worked—the Golden Gate Bridge feature was causally active, and boosting it changed what the model said. But many features seemed inert or unreliable. In March 2025, Lewis Smith and colleagues at Google DeepMind tested whether sparse autoencoder features helped detect harmful intent in user prompts. The features performed worse than simpler methods. The team announced it was deprioritizing sparse autoencoder research. Sharkey called this “the straw that broke the camel’s back.”
Not everyone felt the crisis equally, and researchers continued making progress within the second-wave framework. But writing in June 2025, Sharkey proposed an alternative: parameter decomposition.
The third wave?
Second-wave methods decomposed activations, the neural activity produced when a network processes an input. Parameter decomposition instead focused on weights, the fixed numerical parameters that define what the network does. Weights do not change from input to input. They are the structure.
Sharkey and colleagues proposed decomposing a network’s weights into components, where each component corresponded to a distinct mechanism. A mechanism is a piece of the algorithm the network has learned. If a network knows that “the sky is blue,” there should be a weight component implementing that knowledge. On inputs that do not involve sky color, removing that component should not change the output.
The second wave treated features as the fundamental unit of neural networks but could not define what a feature was. Parameter decomposition made mechanisms the fundamental unit instead. Unlike features, mechanisms had a more precise definition: weight components that could be removed on irrelevant inputs without affecting output. Features became secondary, simply properties of inputs that activated particular mechanisms.
In theory, parameter decomposition also addressed the other anomalies. Feature geometry was no longer mysterious. Features were arranged the way they were because mechanisms required that arrangement. The days of the week formed a circle because some mechanism performed cyclic computations on them. Feature splitting also seemed more constrained. Weight components had to sum to the actual parameters, limiting how finely the network could be decomposed.
Sharkey emphasized that parameter decomposition was not yet mature. The methods were not scalable enough to demonstrate that they resolved the anomalies in practice. But the conceptual architecture existed, and if it worked, mechanistic interpretability might be approaching a third wave.
I am excited to see how the story of mechanistic interpretability unfolds. Understanding neural networks is foundational to building AI systems we can trust.
Conclusion
The physicist Richard Feynman famously said, “Philosophy of science is about as useful to scientists as ornithology is to birds.” Yet here we are. And this is just mechanistic interpretability! The rest of AI research is equally ripe for philosophical examination.
Sharkey is a scientist appraising his own field philosophically. He identified an underspecified concept, traced inherited assumptions, and proposed an alternative framework. These are exactly the contributions the 2019 paper described.
Philosophers of science have expertise that working scientists typically lack: years spent studying how fields get stuck, how foundational concepts evolve, and what kinds of moves have led to breakthroughs in biology, physics, and cognitive science. Imagine what they could contribute if embedded in AI labs, not only to work on model behavior or ethics, but to study the science of AI itself. And because AI researchers face strong incentives to focus on capability gains rather than foundational questions, philosophers may be uniquely positioned to do this work.
I would be excited to read philosophical work on scaling laws, to name just one area that could benefit. Model performance improves predictably with more data and compute, but why? Scaling laws are driving the biggest capital expenditure boom in history, yet they rest on empirical regularities, straight-line plots on log-log graphs, without theoretical foundations. A recent paper presented at the Philosophy of Science Association argued that the strong scaling hypothesis, which predicts new cognitive abilities will emerge at scale, is not falsifiable in its current form. What justifies the inductive inference? What underpins the different scaling paradigms, from Kaplan to Chinchilla to inference-time compute?
Artificial intelligence is moving fast! The birds, it turns out, could use some ornithologists. Without philosophy, the field risks becoming an engineering discipline that changes the world without understanding what it is building.
Acknowledgements
Thank you to Kola Ayonrinde for substantial discussion and comments. Thank you to my colleagues, especially Rebecca Lowe, Ryan Hauser, and Cullen O’Hara, for valuable feedback.
On that last recommendation, such venues now exist! Nature publishes Perspective articles, NeurIPS accepts position papers. Yet they remain underused by philosophers.





