Research

My research asks when learned representations provably recover semantically meaningful structure, and the corresponding practical payoff for interpretability and control. Below are brief teasers of the conceptual space around some of my recent and upcoming work.

Causality and Identifiability for Generalisable Interpretability and Control

Pretrained models are powerful but difficult to adapt minimally to downstream tasks or to steer for counterfactual control. Identifiable representation learning provides a potentially useful mathematical foundation for both: by guaranteeing uniqueness of latent variables under appropriate assumptions, it specifies exactly what structure a representation contains, what can be intervened on independently, and what cannot.

Interpretability is thus the first application: if the recovered variables are provably unique, causal claims about circuits and features become testable. My first project develops the Sparse Shift Autoencoder (SSAE), which recovers identifiable concept vectors from LLM representations by imposing sparsity not on the concepts themselves, but on the distribution of concept shifts across paired observations. This posits concepts naturally as factors varying in observed data, and does not require access to any privileged pairs; it works by uniformly sampling any two prompts. We validate SSAEs on multiple LLMs and leverage constrained optimisation for stable sparse optimisation across scale and choice of data.

Two co-authored studies probe where existing tools break. The first builds a multi-factor benchmark to study if concepts recovered by sparse autoencoders are isolated in single dimensions (they are not!). The second separates the linear representation hypothesis from linear separability and shows that amortised unsupervised featurisers degrade systematically under distribution shift. Both failures are structural and both point to the need for evaluation methods that account for concept geometry and identifiability rather than assuming them.

Optimisation as a Bottleneck for Operationalising Identifiability

Achieving identifiability in practice in unsupervised representation learning is also constrained by optimisation bottlenecks. If we assume the quality of a representation to be indicated by its degree of identifiability (and not predictive loss as in supervised learnig), we need to study its variations in an analogous manner to how generalisation error is studied, while leveraging the benefits of over-parameterisation, which is uncommon in identifiable representation learning. In a limited set of experiments, I have observed that constrained optimisation helps, ensuring stable hyperparameter transfer. The open problem is characterising how parameterisation, architecture, and optimiser bias jointly determine reachability of identifiable solutions, and deriving scaling laws for identifiability error analogous to those for predictive loss.

Evaluating Identifiability on Real-World Data through Causal Hypothesis Testing

Widely used evaluation metrics for assessing identifiability conflate properties of the data distribution with properties of the encoding function, and therefore fail to assess the structural quality of a representation. The core difficulty is that these metrics rest on restrictive assumptions about what a good representation looks like. In practice, useful representations may distribute a single factor across multiple latent dimensions. They also fail when the ground-truth factors are themselves strongly correlated, precisely the regime where theory still guarantees recoverability but empirical evaluation becomes misleading. Worse, in realistic domains where ground truth is unavailable (such as in interpretability research), these metrics implicitly assume the learned representation must match a fixed "true" latent dimensionality, even though this quantity is neither observable nor well-defined. Models that benefit from over-parameterisation are penalised simply because they violate assumptions hard-coded into the metric. My research reframes evaluation through testable implications aligned with the hierarchy of questions a representation should support. The approach is to construct controlled task families that assess what the representation knows, without requiring access to unobserved ground truth.

Applications: Jailbreak Mechanisms and Multi-Agent Cooperation

Identifiability determines whether you can reliably steer a model. I apply this framework to two concrete problems.

Jailbreak mechanisms. LLMs remain vulnerable to jailbreaks, but little is known about the internal mechanisms that differentiate safe from adversarial behaviour. The key observation is that while jailbreaks are mediated by dense perturbations in input text space, the corresponding changes in latent space should be relatively sparse, making them amenable to identifiable decomposition (such as via SSAEs). Where is such safety-relevant information stored in the network (non-localised) to be reliably controlled at the appropriate level of abstraction?

Multi-agent cooperation. When LLMs act as interacting agents in social dilemmas, do they internalise behavioural strategies such as cooperation, defection, reciprocity? Do LLMs encode identifiable behavioural vectors that generalise across model families and dialogue contexts that can be manipulated to control their behaviour in multi-agent settings?