TxPert: Predicting Cellular Responses to Unseen Genetic Perturbations

Introduction

Understanding how cells respond to genetic perturbations is essential to designing effective therapies and accelerating drug development. Yet the traditional approach, screening large numbers of perturbations experimentally across extensive biological contexts (such as different cell types), is expensive and, for all combinations of interest, simply intractable. This often involves a degree of “blind optimization,” where interventions are tested somewhat speculatively without a deep understanding of the underlying cellular machinery, contributing to high failure rates for candidates in late stages of the drug-discovery pipeline.

This bottleneck highlights the critical need for ML models capable of learning the underlying rules of cellular behaviour. At Valence Labs, Recursion’s AI research engine, we believe progress hinges on developing models that can accurately predict cellular responses, especially for novel interventions or in new biological contexts. Out-of-distribution (OOD) generalization to new perturbations and cellular contexts is an essential part of the Predict pillar of an effective virtual cell—a vision focused on building models that predict, explain, and help discover biological phenomena.

See our recent perspective paper for more on our vision.

As an important step in addressing this prediction challenge, we introduce TxPert: a state-of-the-art model that leverages multiple biological knowledge networks to accurately predict transcriptional responses under OOD scenarios.

TxPert: A Unifying Framework for Transcriptomic Perturbation Prediction

TxPert moves beyond models trained on single datasets, offering a unifying approach that:

1. Learns from diverse datasets: It is trained on a broad collection of single-cell perturbation datasets, spanning various experimental techniques and cellular systems.
2. Excels at multiple OOD Tasks: TxPert is specifically engineered to generalize across:
  - Unseen single-gene perturbations within known cell types.
  - Novel combinations of gene perturbations (e.g., double-gene perturbations).
  - Known perturbations in entirely new cell types not seen during training.
3. Leverages biological inductive biases: A cornerstone of TxPert is its integration of multiple biological knowledge graphs. These range from curated public resources like STRINGdb (protein-protein interactions) and Gene Ontology (GO) to unique, large-scale relationship maps derived from Recursion’s phenomics (PxMap) and transcriptomics (TxMap) perturbation screens. This imbues the model with a structured understanding of biological relationships.

Across multiple human cell lines (K562, RPE1, HEPG2, Jurkat), TxPert consistently outperforms existing methods like GEARS and scLAMBDA in predicting the impact of targeting unseen genes. For example, its predictive accuracy for unseen single perturbations often approaches experimental reproducibility levels. For novel double-gene perturbations, TxPert surpasses standard additive baselines, indicating a capacity to model more complex, synergistic interactions. And critically, it demonstrates effective generalization when predicting known perturbation effects in new cell types.

Explore TxPert

For researchers wishing to explore TxPert’s predictions, we have developed an interactive web application. This tool allows users to select cell types, define genetic perturbations, and visualize TxPert’s in silico predictions of gene expression changes for chosen observation genes. It provides a direct way to generate hypotheses and explore potential cellular responses. The code is also open-sourced and available for access here.

Below is an example of TxPert in action. We selected HEPG2 (human liver cancer cells) as the cell type and knocked out the PSMB5 gene. PSMB5 is crucial because it codes for a key component of a complex responsible for breaking down many proteins inside cells. Problems with PSMB5 have been linked to various diseases, especially cancer and neurodegenerative disorders, making this an interesting gene to study. We’re shown a network of genes known to interact with PSMB5. The displayed interaction score comes directly from the biological graphs TxPert was trained on. We can now select a few of these genes to understand how expression changes when PSMB5 is knocked out. We can see that some genes have drastic changes, while some have fewer.

This has the potential to give us further insights into disease mechanisms to explore as further therapeutic opportunities.

Try TxPert

How TxPert Works: Integrating AI with Structured Biological Knowledge

TxPert’s ability to generalize stems from its latent transfer paradigm, where it learns to represent both the cell’s initial state and the impact of perturbations in a shared, structured space before predicting the outcome. This is achieved through two key modules:

Basal State Encoder: This module captures the cell’s intrinsic state (cell type, experimental batch conditions, etc.) from its pre-perturbation gene expression profile, creating a concise embedding.
Perturbation Encoder: This module leverages Graph Neural Networks (GNNs) to learn informative representations of genetic perturbations. Instead of treating genes as isolated entities, the GNN considers their connectivity within the integrated biological knowledge graphs. GNN architectures explored within TxPert, such as Exphormer-MG and GAT-MultiLayer, are designed to effectively fuse information from multiple, complementary graph sources, allowing the model to learn which biological relationships are most salient for predicting the effect of a given perturbation.

By combining these learned representations, TxPert predicts the post-perturbation gene expression profile. Our analyses strongly confirm that the quality, accuracy, and integration of multiple diverse knowledge graphs are crucial for enhancing predictive performance—the more structured biological context the model has, the better it generalizes. For example, TxPert not only predicts local effects on graph neighbours of a perturbed gene but also captures broader, transcriptome-wide functional changes consistent with known biology (e.g., for ribosome maturation factor TSR2).

Advancing the Field: Our Commitment to Benchmarking

A significant contribution of the TxPert project is the introduction of an expanded and rigorous benchmarking framework. The field has struggled with inconsistent evaluation standards, making fair comparison and iterative improvement difficult. Our framework incorporates best practices such as meticulous batch-matched control handling (critical due to significant batch effects in biological data) and evaluation metrics (like retrieval metrics alongside Pearson Δ) designed to specifically assess a model’s ability to capture perturbation-specific signals rather than just general cellular stress responses or mean effects. Further, we put our results in the context of both simple baselines and the soft target of experimental reproducibility.

Why This Matters: Towards a More Predictive Future for Drug Discovery

The ability to accurately simulate cellular responses to novel genetic perturbations, as demonstrated by TxPert, is a significant step. While TxPert focuses on transcriptomic changes and is an important contribution rather than a complete solution to cellular simulation, this offers a glimpse into a future where:

Therapeutic hypotheses are prioritized more strategically, moving beyond exhaustive experimental screening towards more rational, simulation-guided design.
Drug discovery cycles are accelerated by focusing wet-lab resources on the most promising interventions identified in silico.
Our understanding of complex biological networks is deepened through models that learn from and integrate vast webs of biological knowledge.

TxPert provides a robust framework and establishes a new performance benchmark for future developments in predicting perturbation effects. At Valence Labs, we view this work as a foundational element in our broader mission to construct comprehensive virtual cells—dynamic, multi-modal models capable of predicting, explaining, and helping to discover cellular behavior at an unprecedented scale. This research is integral to the larger effort at Recursion of decoding biology to radically improve lives.