Jai Bhagat -- December 2024 (Draft)
The history of AI is intertwined with neuroscience. Artificial neural networks
(ANNs), which serve as the foundations of ubiquitious contemporary generative AI systems
like ChatGPT and Midjourney, were inspired by biological neuronal networks, hence their
namesake. Over time, however, neuroscience's influence on AI has diminished, as progress
in AI has advanced mainly as a result of:
Many in the field of neuroAI still believe in the potential of neuroscience to influence AI; neuromorphic hardware is one area where this could happen, especially when considering that the human brain uses less than 5% of the power consumed by a contemporary high-end consumer-grade GPU. This post, however, will not focus on how neuroscience may still be able to influence AI, but rather the reverse: how AI can influence neuroscience.
Note: in this section, "AI" will refer more specifically to supervised or semi-supervised
training of deep artificial neural networks; i.e. deep learning.
Early applications of AI in systems neuroscience focused primarily on regression and
classification problems: for instance, decoding neural activity patterns to predict
features of the environment or the organism's behavior, or classifying behavioral or
environmental states from sensor data such as video or audio. Although these approaches
improved our predictive power, they largely treated biological neural circuits as black
boxes. As deep learning developed throughout the 2010s, more sophisticated tools emerged.
In particular, convolutional ANNs revolutionized behavior tracking and pose estimation,
while recurrent ANNs provided new ways to model temporal dynamics in neural circuits. One
pivotal development came with the introduction of latent variable models, which allowed us
to reliably discover low-dimensional structure in high-dimensional neural activity.
These models, however, while powerful, often learn representations that are difficult to
interpret biologically: “there remains a gap in our understanding of how these latent
spaces map to neural-level computations”
[1].
AI researchers have faced similar challenges in understanding the inner workings of their
ANN models. Understanding the mechanisms by which a model makes decisions is crucial for
debugging and improving it; i.e. for developing AI safely and evolving AI capabilities. Early
AI interpretability work focused primarily on visualization techniques for understanding
what "visual features" a convolutional ANN learned after learning to successfully classify
objects in an image. However, a key shift occurred in the early 2020s with the emergence
of the field of mechanistic interpretability, which sought to move beyond beyond simply
visualizing learned features or attributing importance to input, and instead
aimed to reverse engineer the underlying computational mechanisms in ANNs.
Mechanistic interpretability emerged as AI researchers sought to understand how large language models (LLMs) implement specific capabilities. Key techniques include activation pattern analysis, circuit discovery, and sparse autoencoder (SAE) feature extraction. The field's core insight is that ANNs, despite their complexity, implement computations through interpretable features and circuits that we can systematically identify and understand. This mirrors a fundamental assumption in systems neuroscience: that neural circuits implement specific computations through identifiable mechanisms. Therefore, I believe that the tools developed to understand artificial neural networks can provide provide new ways to understand biological ones.
My recent work applies SAEs to neural activity in mouse visual cortex when mice are presented a variety of visual stimuli. My analysis of SAEs trained on neural activity ended up being very similar to the analysis done on SAEs trained on transformer network linear layer activations, where learned SAE features have been found to correspond to particular linguistic features. My approach here involved several key steps:
Interestingly, I found that particular SAE features correspond to particular visual stimuli properties, like full-field flashes! This work shows that SAEs can learn to reconstruct neural spike activity and compress it into sparse, interpretable features that can be traced back to the source spiking activity. Crucially, we can validate these features by examining their relationship to known properties of visual processing. This work suggests the SAE is discovering genuine computational features rather than arbitrary representations, and gives promise to the idea that we can use SAEs to find novel neural representations!
As we stand on the cusp of a new era in neuroAI, the potential for mechanistic interpretability to advance our understanding of biological nervous systems is immense. By continuing to refine and expand our interpretability toolbox, we can unlock new insights into both biological and artificial intelligence. Ultimately, the goal is not just to better understand intelligent systems, but to leverage this knowledge to advance humanity through diverse applications, from developing targeted disease therapies and enhancing cognitive capabilities to improving AI systems and addressing fundamental challenges in AI safety. Mechanistic interpretability is our compass on this journey, guiding us toward a deeper, more integrated understanding of intelligence in all its forms.