What can neuroscience learn from AI?
Mechanistic interpretability for neural interpretability.

Jai Bhagat -- December 2024   (Draft)


The history of AI is intertwined with neuroscience. Artificial neural networks (ANNs), which serve as the foundations of ubiquitious contemporary generative AI systems like ChatGPT and Midjourney, were inspired by biological neuronal networks, hence their namesake. Over time, however, neuroscience's influence on AI has diminished, as progress in AI has advanced mainly as a result of:

  1. Theoretical concepts that have little to do with how biological neural computation is performed (e.g. the transformer architecture)
  2. Scaling of compute and data (i.e. making ANNs bigger and training them for longer on more data) giving rise to increasingly intelligent behavior.

Many in the field of neuroAI still believe in the potential of neuroscience to influence AI; neuromorphic hardware is one area where this could happen, especially when considering that the human brain uses less than 5% of the power consumed by a contemporary high-end consumer-grade GPU. This post, however, will not focus on how neuroscience may still be able to influence AI, but rather the reverse: how AI can influence neuroscience.

A brief history of AI tools for systems neuroscience

Note: in this section, "AI" will refer more specifically to supervised or semi-supervised training of deep artificial neural networks; i.e. deep learning.

Early applications of AI in systems neuroscience focused primarily on regression and classification problems: for instance, decoding neural activity patterns to predict features of the environment or the organism's behavior, or classifying behavioral or environmental states from sensor data such as video or audio. Although these approaches improved our predictive power, they largely treated biological neural circuits as black boxes. As deep learning developed throughout the 2010s, more sophisticated tools emerged.

In particular, convolutional ANNs revolutionized behavior tracking and pose estimation, while recurrent ANNs provided new ways to model temporal dynamics in neural circuits. One pivotal development came with the introduction of latent variable models, which allowed us to reliably discover low-dimensional structure in high-dimensional neural activity. These models, however, while powerful, often learn representations that are difficult to interpret biologically: “there remains a gap in our understanding of how these latent spaces map to neural-level computations” [1].

AI researchers have faced similar challenges in understanding the inner workings of their ANN models. Understanding the mechanisms by which a model makes decisions is crucial for debugging and improving it; i.e. for developing AI safely and evolving AI capabilities. Early AI interpretability work focused primarily on visualization techniques for understanding what "visual features" a convolutional ANN learned after learning to successfully classify objects in an image. However, a key shift occurred in the early 2020s with the emergence of the field of mechanistic interpretability, which sought to move beyond beyond simply visualizing learned features or attributing importance to input, and instead aimed to reverse engineer the underlying computational mechanisms in ANNs.

Mechanistic interpretability: foundations and promise

Mechanistic interpretability emerged as AI researchers sought to understand how large language models (LLMs) implement specific capabilities. Key techniques include activation pattern analysis, circuit discovery, and sparse autoencoder (SAE) feature extraction. The field's core insight is that ANNs, despite their complexity, implement computations through interpretable features and circuits that we can systematically identify and understand. This mirrors a fundamental assumption in systems neuroscience: that neural circuits implement specific computations through identifiable mechanisms. Therefore, I believe that the tools developed to understand artificial neural networks can provide provide new ways to understand biological ones.

Training sparse autoencoders on neural data: a case study

My recent work applies SAEs to neural activity in mouse visual cortex when mice are presented a variety of visual stimuli. My analysis of SAEs trained on neural activity ended up being very similar to the analysis done on SAEs trained on transformer network linear layer activations, where learned SAE features have been found to correspond to particular linguistic features. My approach here involved several key steps:

Interestingly, I found that particular SAE features correspond to particular visual stimuli properties, like full-field flashes! This work shows that SAEs can learn to reconstruct neural spike activity and compress it into sparse, interpretable features that can be traced back to the source spiking activity. Crucially, we can validate these features by examining their relationship to known properties of visual processing. This work suggests the SAE is discovering genuine computational features rather than arbitrary representations, and gives promise to the idea that we can use SAEs to find novel neural representations!

Mechanistic interpretability for neural interpretability: the road ahead

As we stand on the cusp of a new era in neuroAI, the potential for mechanistic interpretability to advance our understanding of biological nervous systems is immense. By continuing to refine and expand our interpretability toolbox, we can unlock new insights into both biological and artificial intelligence. Ultimately, the goal is not just to better understand intelligent systems, but to leverage this knowledge to advance humanity through diverse applications, from developing targeted disease therapies and enhancing cognitive capabilities to improving AI systems and addressing fundamental challenges in AI safety. Mechanistic interpretability is our compass on this journey, guiding us toward a deeper, more integrated understanding of intelligence in all its forms.