Introduction

Understanding how cells respond to genetic perturbations is essential to designing effective therapies and accelerating drug development. Yet the traditional approach, screening large numbers of perturbations experimentally across extensive biological contexts (such as different cell types), is expensive and, for all combinations of interest, simply intractable. This often involves a degree of “blind optimization,” where interventions are tested somewhat speculatively without a deep understanding of the underlying cellular machinery, contributing to high failure rates for candidates in late stages of the drug-discovery pipeline.

This bottleneck highlights the critical need for ML models capable of learning the underlying rules of cellular behaviour. At Valence Labs, Recursion’s AI research engine, we believe progress hinges on developing models that can accurately predict cellular responses, especially for novel interventions or in new biological contexts. Out-of-distribution (OOD) generalization to new perturbations and cellular contexts is an essential part of the Predict pillar of an effective virtual cell—a vision focused on building models that predict, explain, and help discover biological phenomena.

See our recent perspective paper for more on our vision. 

As an important step in addressing this prediction challenge, we introduce TxPert: a state-of-the-art model that leverages multiple biological knowledge networks to accurately predict transcriptional responses under OOD scenarios.

TxPert: A Unifying Framework for Transcriptomic Perturbation Prediction

TxPert moves beyond models trained on single datasets, offering a unifying approach that:

  1.  
    1. Learns from diverse datasets: It is trained on a broad collection of single-cell perturbation datasets, spanning various experimental techniques and cellular systems.

    2. Excels at multiple OOD Tasks: TxPert is specifically engineered to generalize across:

      • Unseen single-gene perturbations within known cell types.
      • Novel combinations of gene perturbations (e.g., double-gene perturbations).
      • Known perturbations in entirely new cell types not seen during training.

    3. Leverages biological inductive biases: A cornerstone of TxPert is its integration of multiple biological knowledge graphs. These range from curated public resources like STRINGdb (protein-protein interactions) and Gene Ontology (GO) to unique, large-scale relationship maps derived from Recursion’s phenomics (PxMap) and transcriptomics (TxMap) perturbation screens. This imbues the model with a structured understanding of biological relationships.


Across multiple human cell lines (K562, RPE1, HEPG2, Jurkat), TxPert consistently outperforms existing methods like GEARS and scLAMBDA in predicting the impact of targeting unseen genes. For example, its predictive accuracy for unseen single perturbations often approaches experimental reproducibility levels. For novel double-gene perturbations, TxPert surpasses standard additive baselines, indicating a capacity to model more complex, synergistic interactions. And critically, it demonstrates effective generalization when predicting known perturbation effects in new cell types.

Explore TxPert

For researchers wishing to explore TxPert’s predictions, we have developed an interactive web application. This tool allows users to select cell types, define genetic perturbations, and visualize TxPert’s in silico predictions of gene expression changes for chosen observation genes. It provides a direct way to generate hypotheses and explore potential cellular responses. The code is also open-sourced and available for access here.

Below is an example of TxPert in action. We selected HEPG2 (human liver cancer cells) as the cell type and knocked out the PSMB5 gene. PSMB5 is crucial because it codes for a key component of a complex responsible for breaking down many proteins inside cells. Problems with PSMB5 have been linked to various diseases, especially cancer and neurodegenerative disorders, making this an interesting gene to study. We’re shown a network of genes known to interact with PSMB5. The displayed interaction score comes directly from the biological graphs TxPert was trained on. We can now select a few of these genes to understand how expression changes when PSMB5 is knocked out. We can see that some genes have drastic changes, while some have fewer.

This has the potential to give us further insights into disease mechanisms to explore as further therapeutic opportunities.

How TxPert Works: Integrating AI with Structured Biological Knowledge

TxPert’s ability to generalize stems from its latent transfer paradigm, where it learns to represent both the cell’s initial state and the impact of perturbations in a shared, structured space before predicting the outcome. This is achieved through two key modules:

  1. Basal State Encoder: This module captures the cell’s intrinsic state (cell type, experimental batch conditions, etc.) from its pre-perturbation gene expression profile, creating a concise embedding.

  2. Perturbation Encoder: This module leverages Graph Neural Networks (GNNs) to learn informative representations of genetic perturbations. Instead of treating genes as isolated entities, the GNN considers their connectivity within the integrated biological knowledge graphs. GNN architectures explored within TxPert, such as Exphormer-MG and GAT-MultiLayer, are designed to effectively fuse information from multiple, complementary graph sources, allowing the model to learn which biological relationships are most salient for predicting the effect of a given perturbation.

By combining these learned representations, TxPert predicts the post-perturbation gene expression profile. Our analyses strongly confirm that the quality, accuracy, and integration of multiple diverse knowledge graphs are crucial for enhancing predictive performance—the more structured biological context the model has, the better it generalizes. For example, TxPert not only predicts local effects on graph neighbours of a perturbed gene but also captures broader, transcriptome-wide functional changes consistent with known biology (e.g., for ribosome maturation factor TSR2).

Advancing the Field: Our Commitment to Benchmarking

A significant contribution of the TxPert project is the introduction of an expanded and rigorous benchmarking framework. The field has struggled with inconsistent evaluation standards, making fair comparison and iterative improvement difficult. Our framework incorporates best practices such as meticulous batch-matched control handling (critical due to significant batch effects in biological data) and evaluation metrics (like retrieval metrics alongside Pearson Δ) designed to specifically assess a model’s ability to capture perturbation-specific signals rather than just general cellular stress responses or mean effects. Further, we put our results in the context of both simple baselines and the soft target of experimental reproducibility.

Why This Matters: Towards a More Predictive Future for Drug Discovery

The ability to accurately simulate cellular responses to novel genetic perturbations, as demonstrated by TxPert, is a significant step. While TxPert focuses on transcriptomic changes and is an important contribution rather than a complete solution to cellular simulation, this offers a glimpse into a future where:

  • Therapeutic hypotheses are prioritized more strategically, moving beyond exhaustive experimental screening towards more rational, simulation-guided design.
  • Drug discovery cycles are accelerated by focusing wet-lab resources on the most promising interventions identified in silico.
  • Our understanding of complex biological networks is deepened through models that learn from and integrate vast webs of biological knowledge.

TxPert provides a robust framework and establishes a new performance benchmark for future developments in predicting perturbation effects. At Valence Labs, we view this work as a foundational element in our broader mission to construct comprehensive virtual cells—dynamic, multi-modal models capable of predicting, explaining, and helping to discover cellular behavior at an unprecedented scale. This research is integral to the larger effort at Recursion of decoding biology to radically improve lives.

Introduction

The quest to decode biology is one of humanity’s most enduring and critical challenges. Traditional drug discovery, a painstaking process often spanning over a decade and costing billions, still faces a high rate of failure in clinical trials. We believe a fundamental shift is necessary to alter this paradigm—one that moves towards a more comprehensive, predictive, and mechanistic understanding of biology.

What if we could construct computational models that not only predict cellular responses to interventions but also explain why those responses occur? What if these “virtual cells” could help us navigate the immense complexity of biology to discover and develop new medicines more efficiently and effectively?

At Valence Labs, Recursion’s AI research engine, we are exploring this frontier. We believe the time is ripe to engineer the foundations for such virtual cells. In our recently released perspective paper, “Virtual Cells: Predict, Explain, Discover,” we share our thinking on developing and validating these computational systems.

Predict, Explain, Discover: The Pillars of the Virtual Cell

Our view of the virtual cell rests on three interconnected capabilities:

  • Predict: Virtual cells must accurately predict how cells respond to a broad range of perturbations—from genetic edits to novel chemical compounds. The ability to accurately predict changes across diverse biological readouts—gene expression, protein activity, cellular morphology—is key to prioritizing hypotheses and de-risking the early stages of drug discovery.
  • Explain: Moving beyond prediction, virtual cells must offer mechanistic explanations that support reasoning about what would happen under different interventions—not just what did happen. This means enabling counterfactual reasoning and capturing how perturbations ripple through the complex, hierarchical organization of biological systems, from molecular networks to cellular functions to tissue-level behaviours, higher-order systems, and beyond. Such causal insights are essential for building biological understanding and for designing interventions (i.e. therapeutics) with greater precision and confidence.
  • Discover: When these predictive and explanatory capabilities are integrated within a dynamic, learning-driven framework, virtual cells can evolve into powerful engines for biological discovery. We envision systems that iteratively generate hypotheses, design experiments for validation, and continuously refine their internal models, actively guiding the search for therapeutic opportunities.

Foundational Pillars at Recursion Enable the Pursuit of the Virtual Cell

Realizing this vision is possible due to unique capabilities within Recursion:

  • Industrial-Scale Interventional Data: Building and validating virtual cells requires biological data at an unprecedented scale and diversity. Recursion has dedicated over a decade to creating one of the world’s largest and most relatable proprietary biological and chemical datasets, now exceeding 65 petabytes. This dynamic and interventional data spans phenomics, transcriptomics, proteomics, invivomics, patient-level data and more. With up to 2.2 million experiments executed weekly via Recursion’s automated labs, these datasets continue to expand.

  • Pioneering AI and Machine Learning for Biology: Modern AI/ML techniques power virtual cells.  We’re focused on developing and applying ML approaches tailored to the complexity of biology and drug discovery, building on a strong track record that includes foundational research, open-source models, and real-world applications.

  • Purpose-Built Supercomputing Infrastructure: The scale of biological data and complexity of AI models necessitate significant computational capacity. Recursion’s BioHive-2, the most powerful supercomputer in biopharma, provides us with the essential infrastructure to train industry-leading models and explore the frontiers of biological simulation.

  • Synergy of Automation and AI: Tight integration between Recursion’s advanced lab automation and Valence Labs’ AI model development creates powerful feedback loops, enabling rapid cycles of hypothesis generation, experimental validation, and model refinement.

Laying the Groundwork: TxPert as an Early Step in Our Blueprint

With Recursion, we are taking concrete, iterative steps to build upon our foundations on our path to virtual cells. Today, alongside our perspective paper, we are introducing TxPert, a state-of-the-art model for predicting transcriptional responses to combinatorial genetic perturbations. TxPert is designed to perform robustly in out-of-distribution (OOD) scenarios, such as predicting the effects of unseen gene knockdowns, combinations of perturbations, or responses in new cellular contexts.

Accurately predicting these responses is essential for prioritizing hypotheses in drug discovery, where experimental testing of all possibilities is neither scalable nor economical. TxPert achieves strong performance by leveraging biological knowledge networks and graph-based architectures, enabling it to capture context-specific effects and generalize beyond the training distribution.

TxPert tackles the “Predict” capability within our framework. It represents an initial step in our journey towards the virtual cell, and the benchmarking framework it introduces exemplifies our commitment to building models informed by biological complexity and guided by real-world drug discovery outcomes.

The Engine of Discovery: Agentic Systems

The virtual cell is a dynamic, continuously learning system. As detailed in our paper, we envision a “lab-in-the-loop” paradigm where virtual cells are iteratively refined. Conceptually, a virtual cell can be considered a theory of human cellular physiology and pathology that generates hypotheses about biological systems, observable through experiments.

Agents can then use these hypotheses to design and run experiments in the lab, explicitly seeking to falsify the virtual cell’s current “theory”. An experiment that proves a prediction wrong becomes a crucial learning opportunity. This helps refine the model, making it more accurate and robust. By actively seeking experiments that falsify the current virtual cell, we can uncover novel biology and accelerate therapeutic discovery. 

This cycle – predict, test (falsify), refine – creates a powerful discovery engine. In the context of drug discovery, it represents a shift from the slower “design-make-test-measure” paradigm to a faster “design-simulate-test-learn” or “design-simulate” approach, paving the way for new therapeutic opportunities.

Exploring how agentic systems can accelerate this cycle is a key focus. We are actively developing systems capable of not only conditionally generating hypotheses using virtual cells but also prioritizing hypotheses for falsification, designing experiments to efficiently test those hypotheses, orchestrating the execution of experiments, analyzing the experimental outcomes, and integrating these into the virtual cell for iterative refinement. Recursion’s work on the LLM Orchestrated Workflow Engine (LOWE) illustrates this potential.

 

 

 

Bridging Biological Scales: From Molecular Mechanisms to Cellular Function

A virtual cell must also bridge the gap between the molecular world and cellular phenomena. While simulating entire cells from first principles is not yet feasible, we are aiming to integrate across scales. This involves leveraging high-fidelity, bottom-up simulations for key molecular events and connecting these physics-informed models with functional data from Recursion’s large-scale cellular experiments.

Our research includes exploring advanced molecular modeling techniques, including areas related to understanding quantum-derived chemical properties (i.e. OpenQDC), to infuse our models with a deeper mechanistic understanding.

A Path Forward: Rigorous Benchmarking and Collaborative Progress

Tangible progress in the development of virtual cells hinges on establishing and adopting robust, biologically grounded benchmarks. Standardized evaluation is essential for ensuring new methods advance predictive and explanatory capabilities.

Initiatives that promote open benchmarking platforms, such as Polaris, and the sharing of high-quality datasets, like RxRx3-core, are vital steps in establishing these high standards and fostering transparent progress across the field.

Our ultimate goal is to contribute to a future where drug discovery is dramatically accelerated and the probability of success is significantly increased for therapies addressing critical patient needs. The path towards virtual cells and eventually virtual tissues, organs, and even patient-specific models is ambitious but charts a course towards a new era of precision medicine.



We are excited to share our perspective and ongoing work, and we invite the broader scientific community to engage with these ideas and to challenge and build upon them as we collectively work towards a new, more predictable, and powerful paradigm in the discovery of new medicines.

We encourage you to read our full perspective paper, “Virtual Cells: Predict, Explain, Discover,” to delve deeper into the vision guiding our research at Valence Labs.

Introduction

We curated and consolidated 40+ quantum mechanics (QM) datasets, covering 1.5 billion geometries across 70 atom species and 250+ QM methods, into a single, accessible hub called OpenQDC. It’s open-source and the datasets are available for access through the OpenQDC Python library. Install it via pip (pip install OpenQDC) to start downloading and using various QM datasets in just a single line of code.



Github page: https://github.com/valence-labs/openQDC



Website: https://www.openqdc.io/

Challenges with QM Datasets

Developing robust MLIPs requires vast amounts of QM data. Unfortunately, there is a lack of standardized, plug-and-play datasets that can be used to train and test new ML algorithms, hindering the prototyping of new research in this field.



Existing QM datasets span various methods and different chemical spaces. They’re also scattered across several repositories (ex. QCArchive, ColabFit, NablaDFT, GEOM) with missing metadata (e.g. level of theory and units), adding an extra layer of complexity to working these datasets. This not only hampers the adoption and utility of the data, but also stifles opportunities for collaboration among physicists, chemists, ML experts, and experts in other fields, limiting the progress of ML research

Introducing OpenQDC

With OpenQDC, we aim to unify and standardize existing, well-known datasets to advance the future of MLIP research. We collected publicly-available datasets and computed essential metadata that was missing but necessary for accurate data processing (e.g. energy, distance, force units, and isolated atom energies).

The QM methods and physical units are rigorously annotated, validated, and used to provide useful statistics and normalization methods and conversions, providing efficient ways to utilize multiple datasets in new and previously impossible ways to further advance the frontier of MLIP research.

Dataset# conf.# E# F# Atom typeAtom Min/MaxDataset# conf.# E# F# Atom typeAtom Min/Max
ANI-122,057,3741042/26ANI-1x4,956,0058242/63
ANI-1ccx489,5714042/63ANI-2x9,651,71211422/63
COMP6101,3521046/312GDML3,875,4683349/24
GEOM33,078,4831163/181ISO17640,9821051/19
MD22223,44211442/370Molecule3D3,899,6471081/137
MultixQM9133,631229053/29NablaDFT1,275,3401188/57
OrbNet D.2,338,88920172/74Pub. PM6189,890,15510701/215
Pub. B3lyp85,915,77310701/215QM771651052/23
QM7-X4,195,1922164/23QM821,7862053/8
QM9133,8851053/9Qmugs1,992,98420104/228
RevMD17999,9881149/24SN2 React.452,7091062/6
Sol. Prot.2,731,1801152/120Spice1,110,16511152/110
SpiceV22,008,62811172/110tmQM86,66510445/569
Transition1x9,654,8131144/23WaterClusters4,464,7401229/90
Alchemy202,57910411/38ANICCXv2489,4576042/55
BPA13,99311427/27MACEOFF1,001,20011103/150
QM7xv24,195,1923164/23QMugsv21,992,94130104/228
QM7b721176064/60SpiceLv22,004,89331172/110
SCANWater32219081/23VQMd241,104,9821051/21
PtrFrags2,731,9861052/120MDDataset11,819105162/321
QM1B1,000,000,0001059/11DESSM4,955,938
Potential Total1,400,126,27939516701/370
DES370K370,95914022/44DESSM4,955,938170142/34
DESS866617046/34DESS86x852817046/34
Metcalf13,41550412/41X40405097/25
L7780448/112Splinter1,677,830200102/51
Interaction Total7,018,78302022/112

The Open QDC Library

The OpenQDC Python library makes it easy to work with all of the quantum datasets in the hub. It’s a package that aims to provide a simple and efficient way to download, load, and utilize various datasets. You can download datasets with just one line of code.


  • A simple pythonic API: The simplicity of the Python interface ensures ease of use, making it perfect for quick prototyping.
  • ML-Ready: All you manipulate are torch.Tensor, jax.Array or numpy.Array objects.
  • Quantum ready: The quantum methods used by the datasets are checked and standardized to provide additional values, useful normalization, and different statistics.
  • Standardized: The datasets are written in standard and performant formats with annotated metadata like units and labels
  • Performance matters: Read and write multiple formats (memmap, zarr, xyz, etc).
  • Data: Have access to 1.5+ billion data points.
  • Open source & extensible: OpenQDC and all its files and datasets are open source, and you can add your own dataset and share it with the community in just a few minutes.

Getting Started

Install OpenQDC with pip or conda:

Python

pip install openqdc
or
conda install openqdc -c conda-forge

Now you are ready to use all our QM datasets with the ready-to-use CLI:

Unset

openqdc download SpiceV2

Or using the Python API:

Python

from openqdc import SpiceV2

# Automatically download the data
dataset=SpiceV2()

Below is a glimpse of how easy it is to use OpenQDC and how it interfaces with torch and torch_geometric:

Python

# Load the dataset
from openqdc import MACEOFF
from torch.data.utils import DataLoader

dataset=MACEOFF(energy_unit=“ang”,energy_unit=“kj/mol”,array_format=”torch”)

# Create the dataloader by simply passing the dataset
dataloader=DataLoader(dataset, batch_size=32)

# Do your own magic
. . .

OpenQDC being framework agnostic can be easily used with torch_geometric, in this case, we can use the function radius_graph from torch_cluster to create a graph:

Python

from openqdc import SpiceV2
from torch_cluster import radius_graph
from torch_geometric.loader import DataLoader
from torch_geometric.data import Data

# We create a function to convert object into their graph
def to_pyg_data(x):

# or any other techniques to build a graph (or use the smiles from the dataset)
edge_index = radius_graph(x.positions, 5)
return Data(edge_index=edge_index, **x)


# Use the transform attribute to automatically convert your items
ds=SpiceV2(array_format=”torch”, distance_unit=”ang”, transform=to_pyg_data)

# Create the pyg dataloader by simply passing the new dataset
loader = DataLoader(ds, batch_size=32, shuffle=True)

# Do your own magic
. . .

We hope OpenQDC can be a great resource for the community to advance MLIP research towards a future of training universal potentials with greater generalizability and robustness.

Please feel free to share your feedback or connect with the Valence Labs team on GitHubXLinkedIn, or Valence Portal!

Backgroung image

Want to learn more about OpenQDC?

Get in touch with our experts today!
Contact Us

On the Scalability of GNNs for Molecular Graphs

Scaling deep learning models has been at the heart of recent revolutions in language modeling and image generation. Practitioners have observed a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures.



We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing width, number of molecules, number of labels, and diversity in the pretraining datasets.



We’re excited to introduce MolGPS, a 3B parameter model for various molecular property prediction tasks. We hope that this work will pave the way for an era where foundational GNNs drive pharmaceutical drug discovery. Not only does the model’s performance scale with parameters, but it also benefits tremendously from integrating high-level phenomics data into the mix.

Model Details and Performance

MolGPS was trained on the LargeMix dataset mixture consisting of 5 million molecules grouped into 5 different tasks with each task having multiple labels. LargeMix contains datasets like the L1000_VCAP and L1000_MCF7 (transcriptomics), PCBA_1328 (bioassays), PCQM4M_G25 and PCGM4M_N4 (DFT simulations).

We also added a classification dataset using a subset of our Recursion’s phenomics data. This dataset was created by using a pre-trained masked auto-encoder clustering the phenomics images into 6,000 different classes, which are then used for binary classification.

MolGPS was first pretrained using a common multi-task supervised learning strategy and was then finetuned (or probed) for various molecular property prediction tasks to evaluate performance. We benchmarked the performance of MolGPS on the Therapeutics Data Commons (TDC), MoleculeNet, and Polaris benchmarks.

Therapeutics Data Common (TDC) and MoleculeNet

Our study focuses on the 22 ADMET (absorption, distribution, metabolism, excretion, and toxicity) tasks available in TDC. This benchmark has been around for years with continuous submissions from various groups, including both deep-learning models and traditional machine learning at the top of the benchmark, with a total of 8 models sharing the top positions across all 22 tasks. Simply by scaling our model, we found that MolGPS outperforms SOTA on 12/22 tasks.

We investigate 4 datasets from MoleculeNet that are commonly used in similar studies: BACE (that assesses the binding outcomes of a group of inhibitors targeting β-secretase), BBBP (the Blood-Brain Barrier Penetration that assesses if a molecule can penetrate the central nervous system), Clintox (which is relevant to the toxicity of molecular compounds, and Sider (the Side Effect Resource which contains information about adverse drug reactions in a database of marketed drugs). We found that MolGPS outperforms SOTA (all self-supervised or quantum-based self-supervised pre-trained models) on all 4 tasks.

While TDC and MoleculeNet are commonly used benchmarks for open-source drug discovery evaluation, we note that they suffer from data collection and processing biases across dissimilar molecules. These limitations have been described previously in conversations throughout the community.



Polaris is a new collection of benchmarks and datasets curated through a standardized evaluation protocol developed by an industry consortium of biotech’s and pharmaceutical companies. We investigated the performance of MolGPS on 12 ADMET and binding prediction tasks and found that MolGPS outperforms SOTA on 11/12 tasks.

Scaling Experiments

In the following experiments, we look at the performance of MolGPS while increasing the width such that the parameter count goes up from 1M to 3B parameters. To properly assess the benefits of scale, we evaluate the performance of the model when probing the 22 downstream tasks from TDC. Here, the hyper-parameter search is done with 10M parameters, and width scaling is done in a zero-shot setting using the muTransfer technique.


We also observe that GNNs benefit tremendously from increasing the model’s width and that the gains are consistent and linear to the logarithm of the parameter count. Indeed, there appears to be no slowdown in the scaling curve, hinting that we could continue to improve the performance with larger and larger models, similar to those found in large language models (LLMs).


Further, we note that MolGPS shows significant improvement over the performance achieved by the TDC baselines. This represents the best model performance per task from when TDC was first introduced in 2021. Note that on the y-axis, a value of 0 represents the average of all submissions to the TDC benchmark. Compared to the latest SOTA on TDC, our ensemble model passes the line of the best model per task, meaning that it is generally better to just use MolGPS rather than try all 30+ methods that are part of the TDC bencmark.


We also note that the model reaches a limit when scaling on public data only, but the addition of private phenomics data pushes the boundary of scale and performance much further.
Above, we report the “normalized performance” representing the average of the z-score across the 22 tasks from TDC. The z-score is computed based on the model’s performance relative to the leaderboard for a task, adjusted for the polarity of the task metric, i.e., multiplied by -1 if “lower is better”.

Molecules vs. language

In the context of LLMs, our scaling experiment results may seem surprising given our models are only trained on a few million data points (molecules) while LLMs are typically trained on larger datasets comprising trillions of tokens. To better understand the performance increase and gap in dataset size, it’s helpful to draw a few analogies between molecules and language.


In our setting, molecules are analogous to sentences in language processing, while the atoms and bonds are analogous to tokens. Moreover, the task is supervised and some molecules have thousands of associated labels coming from experimental data. This allows the learned molecular embeddings to be much richer than such derived from simply retrieving a missing token.

Backgroung image

Want to learn more about MolGPS?

Get in touch with our team.
Contact Us

LOWE

LOWE—our LLM-orchestrated Workflow Engine—is an LLM agent that represents the next evolution of the Recursion OS. LOWE supports drug discovery programs by orchestrating complex workflows. These workflows chain together a variety of steps and tools, from finding significant relationships within Recursion’s Maps of Biology and Chemistry to generating novel compounds and scheduling them for synthesis and experimentation. Through its natural language interface and interactive graphics, LOWE puts state-of-the-art AI tools into the hands of every drug discovery scientist at Recursion in a simple and scalable way.

Integrating with the Recursion OS

At the forefront of LOWE’s functionality is its ability to integrate with the Recursion OS. This includes access to petabytes of proprietary data and specialized computational tools tailored for drug discovery.

This includes the ability to navigate and assess relationships within Recursion’s proprietary PhenoMap data, use MatchMaker to identify drug-target interactions, and deploy deep learning-based generative chemistry methods. This integration enables LOWE to perform critical, multi-step tasks in drug discovery such as identifying new therapeutic targets, designing novel compounds and libraries, and predicting ADMET properties. Additionally, LOWE streamlines the process of procuring commercial compounds, enhancing the operational efficiency of R&D projects.

Streamlining drug discovery 
workflows at Recursion

Traditionally, early-stage drug discovery involves multi-disciplinary collaboration between teams of chemists and biologists over several months or years. Within Recursion, this process typically requires biologists to delineate biological pathways and establish novel map relationships, followed by chemists optimizing chemical series for the selected targets. LOWE streamlines these processes by capturing these disparate functions within a single user interface, operable using natural language commands. We believe this has the potential to dramatically reduce the time and resources required to progress early discovery programs.

A key aspect of LOWE is the accessibility of AI tools it provides for every Recursionaut scientist. By leveraging an LLM agent and natural language interface, LOWE enables all drug discovery scientists at Recursion, regardless of whether they have formal training in machine learning, to access state-of-the-art AI algorithms and computational tools. Additionally, LOWE provides extensive data visualization to help Recursion’s scientists efficiently parse the output of each query.
We invite potential partners interested in harnessing the power of LOWE for therapeutic discovery using the Recursion OS to contact us at partner@recursionpharma.com.

The future
of TechBio

The growing number of AI tools being developed and datasets being generated at Recursion increases the complexity of early drug discovery workflows. New systems are required to streamline this complexity in order to maximize the full potential of the Recursion OS. This is the true power of LOWE.

LOWE’s ability to streamline complex workflows, integrate with specialized tools, and make cutting-edge AI accessible to all scientists marks a significant advancement in how drug discovery projects can be run. LOWE demonstrates the power of LLM-based workflow engines in enhancing efficiency, fostering innovation, and driving forward the discovery of new and effective medicines. We are excited to see how LOWE will impact the future of drug discovery and the potential breakthroughs it 
will enable.

Ask LOWE to:

Generate novel compounds

Compute ADMET properties

Filter compounds

Give a list of known targets

Order compounds

Schedule compounds for experiments

Use MatchMaker

Expand the chemical space

List compounds by solubility

Generate compounds given a target

Backgroung image

Ready to explore LOWE with us?

Get in touch with our team today!
Contact Us