Generate novel compounds
Introduction
Understanding how cells respond to genetic perturbations is essential to designing effective therapies and accelerating drug development. Yet the traditional approach, screening large numbers of perturbations experimentally across extensive biological contexts (such as different cell types), is expensive and, for all combinations of interest, simply intractable. This often involves a degree of “blind optimization,” where interventions are tested somewhat speculatively without a deep understanding of the underlying cellular machinery, contributing to high failure rates for candidates in late stages of the drug-discovery pipeline.
This bottleneck highlights the critical need for ML models capable of learning the underlying rules of cellular behaviour. At Valence Labs, Recursion’s AI research engine, we believe progress hinges on developing models that can accurately predict cellular responses, especially for novel interventions or in new biological contexts. Out-of-distribution (OOD) generalization to new perturbations and cellular contexts is an essential part of the Predict pillar of an effective virtual cell—a vision focused on building models that predict, explain, and help discover biological phenomena.
See our recent perspective paper for more on our vision.
As an important step in addressing this prediction challenge, we introduce TxPert: a state-of-the-art model that leverages multiple biological knowledge networks to accurately predict transcriptional responses under OOD scenarios.
TxPert: A Unifying Framework for Transcriptomic Perturbation Prediction
TxPert moves beyond models trained on single datasets, offering a unifying approach that:
-
- Learns from diverse datasets: It is trained on a broad collection of single-cell perturbation datasets, spanning various experimental techniques and cellular systems.
- Excels at multiple OOD Tasks: TxPert is specifically engineered to generalize across:
- Unseen single-gene perturbations within known cell types.
- Novel combinations of gene perturbations (e.g., double-gene perturbations).
- Known perturbations in entirely new cell types not seen during training.
- Leverages biological inductive biases: A cornerstone of TxPert is its integration of multiple biological knowledge graphs. These range from curated public resources like STRINGdb (protein-protein interactions) and Gene Ontology (GO) to unique, large-scale relationship maps derived from Recursion’s phenomics (PxMap) and transcriptomics (TxMap) perturbation screens. This imbues the model with a structured understanding of biological relationships.
- Learns from diverse datasets: It is trained on a broad collection of single-cell perturbation datasets, spanning various experimental techniques and cellular systems.
Across multiple human cell lines (K562, RPE1, HEPG2, Jurkat), TxPert consistently outperforms existing methods like GEARS and scLAMBDA in predicting the impact of targeting unseen genes. For example, its predictive accuracy for unseen single perturbations often approaches experimental reproducibility levels. For novel double-gene perturbations, TxPert surpasses standard additive baselines, indicating a capacity to model more complex, synergistic interactions. And critically, it demonstrates effective generalization when predicting known perturbation effects in new cell types.
How TxPert Works: Integrating AI with Structured Biological Knowledge
TxPert’s ability to generalize stems from its latent transfer paradigm, where it learns to represent both the cell’s initial state and the impact of perturbations in a shared, structured space before predicting the outcome. This is achieved through two key modules:
- Basal State Encoder: This module captures the cell’s intrinsic state (cell type, experimental batch conditions, etc.) from its pre-perturbation gene expression profile, creating a concise embedding.
- Perturbation Encoder: This module leverages Graph Neural Networks (GNNs) to learn informative representations of genetic perturbations. Instead of treating genes as isolated entities, the GNN considers their connectivity within the integrated biological knowledge graphs. GNN architectures explored within TxPert, such as Exphormer-MG and GAT-MultiLayer, are designed to effectively fuse information from multiple, complementary graph sources, allowing the model to learn which biological relationships are most salient for predicting the effect of a given perturbation.
By combining these learned representations, TxPert predicts the post-perturbation gene expression profile. Our analyses strongly confirm that the quality, accuracy, and integration of multiple diverse knowledge graphs are crucial for enhancing predictive performance—the more structured biological context the model has, the better it generalizes. For example, TxPert not only predicts local effects on graph neighbours of a perturbed gene but also captures broader, transcriptome-wide functional changes consistent with known biology (e.g., for ribosome maturation factor TSR2).
Advancing the Field: Our Commitment to Benchmarking
A significant contribution of the TxPert project is the introduction of an expanded and rigorous benchmarking framework. The field has struggled with inconsistent evaluation standards, making fair comparison and iterative improvement difficult. Our framework incorporates best practices such as meticulous batch-matched control handling (critical due to significant batch effects in biological data) and evaluation metrics (like retrieval metrics alongside Pearson Δ) designed to specifically assess a model’s ability to capture perturbation-specific signals rather than just general cellular stress responses or mean effects. Further, we put our results in the context of both simple baselines and the soft target of experimental reproducibility.
Why This Matters: Towards a More Predictive Future for Drug Discovery
The ability to accurately simulate cellular responses to novel genetic perturbations, as demonstrated by TxPert, is a significant step. While TxPert focuses on transcriptomic changes and is an important contribution rather than a complete solution to cellular simulation, this offers a glimpse into a future where:
- Therapeutic hypotheses are prioritized more strategically, moving beyond exhaustive experimental screening towards more rational, simulation-guided design.
- Drug discovery cycles are accelerated by focusing wet-lab resources on the most promising interventions identified in silico.
- Our understanding of complex biological networks is deepened through models that learn from and integrate vast webs of biological knowledge.
TxPert provides a robust framework and establishes a new performance benchmark for future developments in predicting perturbation effects. At Valence Labs, we view this work as a foundational element in our broader mission to construct comprehensive virtual cells—dynamic, multi-modal models capable of predicting, explaining, and helping to discover cellular behavior at an unprecedented scale. This research is integral to the larger effort at Recursion of decoding biology to radically improve lives.
Introduction
The quest to decode biology is one of humanity’s most enduring and critical challenges. Traditional drug discovery, a painstaking process often spanning over a decade and costing billions, still faces a high rate of failure in clinical trials. We believe a fundamental shift is necessary to alter this paradigm—one that moves towards a more comprehensive, predictive, and mechanistic understanding of biology.
What if we could construct computational models that not only predict cellular responses to interventions but also explain why those responses occur? What if these “virtual cells” could help us navigate the immense complexity of biology to discover and develop new medicines more efficiently and effectively?
At Valence Labs, Recursion’s AI research engine, we are exploring this frontier. We believe the time is ripe to engineer the foundations for such virtual cells. In our recently released perspective paper, “Virtual Cells: Predict, Explain, Discover,” we share our thinking on developing and validating these computational systems.
Foundational Pillars at Recursion Enable the Pursuit of the Virtual Cell
Realizing this vision is possible due to unique capabilities within Recursion:
- Industrial-Scale Interventional Data: Building and validating virtual cells requires biological data at an unprecedented scale and diversity. Recursion has dedicated over a decade to creating one of the world’s largest and most relatable proprietary biological and chemical datasets, now exceeding 65 petabytes. This dynamic and interventional data spans phenomics, transcriptomics, proteomics, invivomics, patient-level data and more. With up to 2.2 million experiments executed weekly via Recursion’s automated labs, these datasets continue to expand.
- Pioneering AI and Machine Learning for Biology: Modern AI/ML techniques power virtual cells. We’re focused on developing and applying ML approaches tailored to the complexity of biology and drug discovery, building on a strong track record that includes foundational research, open-source models, and real-world applications.
- Purpose-Built Supercomputing Infrastructure: The scale of biological data and complexity of AI models necessitate significant computational capacity. Recursion’s BioHive-2, the most powerful supercomputer in biopharma, provides us with the essential infrastructure to train industry-leading models and explore the frontiers of biological simulation.
- Synergy of Automation and AI: Tight integration between Recursion’s advanced lab automation and Valence Labs’ AI model development creates powerful feedback loops, enabling rapid cycles of hypothesis generation, experimental validation, and model refinement.
Laying the Groundwork: TxPert as an Early Step in Our Blueprint
With Recursion, we are taking concrete, iterative steps to build upon our foundations on our path to virtual cells. Today, alongside our perspective paper, we are introducing TxPert, a state-of-the-art model for predicting transcriptional responses to combinatorial genetic perturbations. TxPert is designed to perform robustly in out-of-distribution (OOD) scenarios, such as predicting the effects of unseen gene knockdowns, combinations of perturbations, or responses in new cellular contexts.
Accurately predicting these responses is essential for prioritizing hypotheses in drug discovery, where experimental testing of all possibilities is neither scalable nor economical. TxPert achieves strong performance by leveraging biological knowledge networks and graph-based architectures, enabling it to capture context-specific effects and generalize beyond the training distribution.
TxPert tackles the “Predict” capability within our framework. It represents an initial step in our journey towards the virtual cell, and the benchmarking framework it introduces exemplifies our commitment to building models informed by biological complexity and guided by real-world drug discovery outcomes.
The Engine of Discovery: Agentic Systems
The virtual cell is a dynamic, continuously learning system. As detailed in our paper, we envision a “lab-in-the-loop” paradigm where virtual cells are iteratively refined. Conceptually, a virtual cell can be considered a theory of human cellular physiology and pathology that generates hypotheses about biological systems, observable through experiments.
Agents can then use these hypotheses to design and run experiments in the lab, explicitly seeking to falsify the virtual cell’s current “theory”. An experiment that proves a prediction wrong becomes a crucial learning opportunity. This helps refine the model, making it more accurate and robust. By actively seeking experiments that falsify the current virtual cell, we can uncover novel biology and accelerate therapeutic discovery.
This cycle – predict, test (falsify), refine – creates a powerful discovery engine. In the context of drug discovery, it represents a shift from the slower “design-make-test-measure” paradigm to a faster “design-simulate-test-learn” or “design-simulate” approach, paving the way for new therapeutic opportunities.
Exploring how agentic systems can accelerate this cycle is a key focus. We are actively developing systems capable of not only conditionally generating hypotheses using virtual cells but also prioritizing hypotheses for falsification, designing experiments to efficiently test those hypotheses, orchestrating the execution of experiments, analyzing the experimental outcomes, and integrating these into the virtual cell for iterative refinement. Recursion’s work on the LLM Orchestrated Workflow Engine (LOWE) illustrates this potential.
Bridging Biological Scales: From Molecular Mechanisms to Cellular Function
A virtual cell must also bridge the gap between the molecular world and cellular phenomena. While simulating entire cells from first principles is not yet feasible, we are aiming to integrate across scales. This involves leveraging high-fidelity, bottom-up simulations for key molecular events and connecting these physics-informed models with functional data from Recursion’s large-scale cellular experiments.
Our research includes exploring advanced molecular modeling techniques, including areas related to understanding quantum-derived chemical properties (i.e. OpenQDC), to infuse our models with a deeper mechanistic understanding.
A Path Forward: Rigorous Benchmarking and Collaborative Progress
Tangible progress in the development of virtual cells hinges on establishing and adopting robust, biologically grounded benchmarks. Standardized evaluation is essential for ensuring new methods advance predictive and explanatory capabilities.
Initiatives that promote open benchmarking platforms, such as Polaris, and the sharing of high-quality datasets, like RxRx3-core, are vital steps in establishing these high standards and fostering transparent progress across the field.
Our ultimate goal is to contribute to a future where drug discovery is dramatically accelerated and the probability of success is significantly increased for therapies addressing critical patient needs. The path towards virtual cells and eventually virtual tissues, organs, and even patient-specific models is ambitious but charts a course towards a new era of precision medicine.
We are excited to share our perspective and ongoing work, and we invite the broader scientific community to engage with these ideas and to challenge and build upon them as we collectively work towards a new, more predictable, and powerful paradigm in the discovery of new medicines.
We encourage you to read our full perspective paper, “Virtual Cells: Predict, Explain, Discover,” to delve deeper into the vision guiding our research at Valence Labs.
Introduction
We curated and consolidated 40+ quantum mechanics (QM) datasets, covering 1.5 billion geometries across 70 atom species and 250+ QM methods, into a single, accessible hub called OpenQDC. It’s open-source and the datasets are available for access through the OpenQDC Python library. Install it via pip (pip install OpenQDC) to start downloading and using various QM datasets in just a single line of code.
Github page: https://github.com/valence-labs/openQDC
Website: https://www.openqdc.io/

Challenges with QM Datasets
Developing robust MLIPs requires vast amounts of QM data. Unfortunately, there is a lack of standardized, plug-and-play datasets that can be used to train and test new ML algorithms, hindering the prototyping of new research in this field.
Existing QM datasets span various methods and different chemical spaces. They’re also scattered across several repositories (ex. QCArchive, ColabFit, NablaDFT, GEOM) with missing metadata (e.g. level of theory and units), adding an extra layer of complexity to working these datasets. This not only hampers the adoption and utility of the data, but also stifles opportunities for collaboration among physicists, chemists, ML experts, and experts in other fields, limiting the progress of ML research
The Open QDC Library
The OpenQDC Python library makes it easy to work with all of the quantum datasets in the hub. It’s a package that aims to provide a simple and efficient way to download, load, and utilize various datasets. You can download datasets with just one line of code.
- A simple pythonic API: The simplicity of the Python interface ensures ease of use, making it perfect for quick prototyping.
- ML-Ready: All you manipulate are torch.Tensor, jax.Array or numpy.Array objects.
- Quantum ready: The quantum methods used by the datasets are checked and standardized to provide additional values, useful normalization, and different statistics.
- Standardized: The datasets are written in standard and performant formats with annotated metadata like units and labels
- Performance matters: Read and write multiple formats (memmap, zarr, xyz, etc).
- Data: Have access to 1.5+ billion data points.
- Open source & extensible: OpenQDC and all its files and datasets are open source, and you can add your own dataset and share it with the community in just a few minutes.
Getting Started
Install OpenQDC with pip or conda:
Python
pip install openqdc
or
conda install openqdc -c conda-forge
Now you are ready to use all our QM datasets with the ready-to-use CLI:
Unset
openqdc download SpiceV2
Or using the Python API:
Python
from openqdc import SpiceV2
dataset=SpiceV2()
Below is a glimpse of how easy it is to use OpenQDC and how it interfaces with torch and torch_geometric:
dataset=MACEOFF(energy_unit=
dataloader=DataLoader(dataset, batch_size=32)
. . .
OpenQDC being framework agnostic can be easily used with torch_geometric, in this case, we can use the function radius_graph from torch_cluster to create a graph:
def
edge_index = radius_graph(x.positions, 5)
return Data(edge_index=edge_index, **x)
ds=SpiceV2(array_format=”torch”, distance_unit=”ang”, transform=to_pyg_data)
loader = DataLoader(ds, batch_size=32, shuffle=True)
. . .
We hope OpenQDC can be a great resource for the community to advance MLIP research towards a future of training universal potentials with greater generalizability and robustness.
Please feel free to share your feedback or connect with the Valence Labs team on GitHub, X, LinkedIn, or Valence Portal!

On the Scalability of GNNs for Molecular Graphs
Scaling deep learning models has been at the heart of recent revolutions in language modeling and image generation. Practitioners have observed a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures.
We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing width, number of molecules, number of labels, and diversity in the pretraining datasets.
We’re excited to introduce MolGPS, a 3B parameter model for various molecular property prediction tasks. We hope that this work will pave the way for an era where foundational GNNs drive pharmaceutical drug discovery. Not only does the model’s performance scale with parameters, but it also benefits tremendously from integrating high-level phenomics data into the mix.
Scaling Experiments

In the following experiments, we look at the performance of MolGPS while increasing the width such that the parameter count goes up from 1M to 3B parameters. To properly assess the benefits of scale, we evaluate the performance of the model when probing the 22 downstream tasks from TDC. Here, the hyper-parameter search is done with 10M parameters, and width scaling is done in a zero-shot setting using the muTransfer technique.
We also observe that GNNs benefit tremendously from increasing the model’s width and that the gains are consistent and linear to the logarithm of the parameter count. Indeed, there appears to be no slowdown in the scaling curve, hinting that we could continue to improve the performance with larger and larger models, similar to those found in large language models (LLMs).
Further, we note that MolGPS shows significant improvement over the performance achieved by the TDC baselines. This represents the best model performance per task from when TDC was first introduced in 2021. Note that on the y-axis, a value of 0 represents the average of all submissions to the TDC benchmark. Compared to the latest SOTA on TDC, our ensemble model passes the line of the best model per task, meaning that it is generally better to just use MolGPS rather than try all 30+ methods that are part of the TDC bencmark.
We also note that the model reaches a limit when scaling on public data only, but the addition of private phenomics data pushes the boundary of scale and performance much further.
Above, we report the “normalized performance” representing the average of the z-score across the 22 tasks from TDC. The z-score is computed based on the model’s performance relative to the leaderboard for a task, adjusted for the polarity of the task metric, i.e., multiplied by -1 if “lower is better”.
Molecules vs. language
In the context of LLMs, our scaling experiment results may seem surprising given our models are only trained on a few million data points (molecules) while LLMs are typically trained on larger datasets comprising trillions of tokens. To better understand the performance increase and gap in dataset size, it’s helpful to draw a few analogies between molecules and language.
In our setting, molecules are analogous to sentences in language processing, while the atoms and bonds are analogous to tokens. Moreover, the task is supervised and some molecules have thousands of associated labels coming from experimental data. This allows the learned molecular embeddings to be much richer than such derived from simply retrieving a missing token.

LOWE
LOWE—our LLM-orchestrated Workflow Engine—is an LLM agent that represents the next evolution of the Recursion OS. LOWE supports drug discovery programs by orchestrating complex workflows. These workflows chain together a variety of steps and tools, from finding significant relationships within Recursion’s Maps of Biology and Chemistry to generating novel compounds and scheduling them for synthesis and experimentation. Through its natural language interface and interactive graphics, LOWE puts state-of-the-art AI tools into the hands of every drug discovery scientist at Recursion in a simple and scalable way.
Integrating with the Recursion OS
At the forefront of LOWE’s functionality is its ability to integrate with the Recursion OS. This includes access to petabytes of proprietary data and specialized computational tools tailored for drug discovery.
This includes the ability to navigate and assess relationships within Recursion’s proprietary PhenoMap data, use MatchMaker to identify drug-target interactions, and deploy deep learning-based generative chemistry methods. This integration enables LOWE to perform critical, multi-step tasks in drug discovery such as identifying new therapeutic targets, designing novel compounds and libraries, and predicting ADMET properties. Additionally, LOWE streamlines the process of procuring commercial compounds, enhancing the operational efficiency of R&D projects.
The future of TechBio
The growing number of AI tools being developed and datasets being generated at Recursion increases the complexity of early drug discovery workflows. New systems are required to streamline this complexity in order to maximize the full potential of the Recursion OS. This is the true power of LOWE.
LOWE’s ability to streamline complex workflows, integrate with specialized tools, and make cutting-edge AI accessible to all scientists marks a significant advancement in how drug discovery projects can be run. LOWE demonstrates the power of LLM-based workflow engines in enhancing efficiency, fostering innovation, and driving forward the discovery of new and effective medicines. We are excited to see how LOWE will impact the future of drug discovery and the potential breakthroughs it
will enable.
Ask LOWE to:
Compute ADMET properties
Filter compounds
Give a list of known targets
Order compounds
Schedule compounds for experiments
Use MatchMaker
Use Recursion’s maps
Expand the chemical space
List compounds by solubility
Generate compounds given a target
