
Introducing MolGPS — A Foundational GNN for Molecular Property Prediction
Read the PaperOn the Scalability of GNNs for Molecular Graphs
Scaling deep learning models has been at the heart of recent revolutions in language modeling and image generation. Practitioners have observed a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures.
We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing width, number of molecules, number of labels, and diversity in the pretraining datasets.
We’re excited to introduce MolGPS, a 3B parameter model for various molecular property prediction tasks. We hope that this work will pave the way for an era where foundational GNNs drive pharmaceutical drug discovery. Not only does the model’s performance scale with parameters, but it also benefits tremendously from integrating high-level phenomics data into the mix.
Scaling Experiments

In the following experiments, we look at the performance of MolGPS while increasing the width such that the parameter count goes up from 1M to 3B parameters. To properly assess the benefits of scale, we evaluate the performance of the model when probing the 22 downstream tasks from TDC. Here, the hyper-parameter search is done with 10M parameters, and width scaling is done in a zero-shot setting using the muTransfer technique.
We also observe that GNNs benefit tremendously from increasing the model’s width and that the gains are consistent and linear to the logarithm of the parameter count. Indeed, there appears to be no slowdown in the scaling curve, hinting that we could continue to improve the performance with larger and larger models, similar to those found in large language models (LLMs).
Further, we note that MolGPS shows significant improvement over the performance achieved by the TDC baselines. This represents the best model performance per task from when TDC was first introduced in 2021. Note that on the y-axis, a value of 0 represents the average of all submissions to the TDC benchmark. Compared to the latest SOTA on TDC, our ensemble model passes the line of the best model per task, meaning that it is generally better to just use MolGPS rather than try all 30+ methods that are part of the TDC bencmark.
We also note that the model reaches a limit when scaling on public data only, but the addition of private phenomics data pushes the boundary of scale and performance much further.
Above, we report the “normalized performance” representing the average of the z-score across the 22 tasks from TDC. The z-score is computed based on the model’s performance relative to the leaderboard for a task, adjusted for the polarity of the task metric, i.e., multiplied by -1 if “lower is better”.
Molecules vs. language
In the context of LLMs, our scaling experiment results may seem surprising given our models are only trained on a few million data points (molecules) while LLMs are typically trained on larger datasets comprising trillions of tokens. To better understand the performance increase and gap in dataset size, it’s helpful to draw a few analogies between molecules and language.
In our setting, molecules are analogous to sentences in language processing, while the atoms and bonds are analogous to tokens. Moreover, the task is supervised and some molecules have thousands of associated labels coming from experimental data. This allows the learned molecular embeddings to be much richer than such derived from simply retrieving a missing token.
