Introducing OpenQDC – The Open-Source Hub of ML-Ready Quantum Datasets

Introduction

We curated and consolidated 40+ quantum mechanics (QM) datasets, covering 1.5 billion geometries across 70 atom species and 250+ QM methods, into a single, accessible hub called OpenQDC. It’s open-source and the datasets are available for access through the OpenQDC Python library. Install it via pip (pip install OpenQDC) to start downloading and using various QM datasets in just a single line of code.  

Github page: https://github.com/valence-labs/openQDC  

Website: https://www.openqdc.io/

Challenges with QM Datasets

Developing robust MLIPs requires vast amounts of QM data. Unfortunately, there is a lack of standardized, plug-and-play datasets that can be used to train and test new ML algorithms, hindering the prototyping of new research in this field.  

Existing QM datasets span various methods and different chemical spaces. They’re also scattered across several repositories (ex. QCArchive, ColabFit, NablaDFT, GEOM) with missing metadata (e.g. level of theory and units), adding an extra layer of complexity to working these datasets. This not only hampers the adoption and utility of the data, but also stifles opportunities for collaboration among physicists, chemists, ML experts, and experts in other fields, limiting the progress of ML research

Introducing OpenQDC

With OpenQDC, we aim to unify and standardize existing, well-known datasets to advance the future of MLIP research. We collected publicly-available datasets and computed essential metadata that was missing but necessary for accurate data processing (e.g. energy, distance, force units, and isolated atom energies).

The QM methods and physical units are rigorously annotated, validated, and used to provide useful statistics and normalization methods and conversions, providing efficient ways to utilize multiple datasets in new and previously impossible ways to further advance the frontier of MLIP research.

Dataset	# conf.	# E	# F	# Atom type	Atom Min/Max	Dataset	# conf.	# E	# F	# Atom type	Atom Min/Max
ANI-1	22,057,374	1	0	4	2/26	ANI-1x	4,956,005	8	2	4	2/63
ANI-1ccx	489,571	4	0	4	2/63	ANI-2x	9,651,712	1	1	4	22/63
COMP6	101,352	1	0	4	6/312	GDML	3,875,468	3	3	4	9/24
GEOM	33,078,483	1	1	6	3/181	ISO17	640,982	1	0	5	1/19
MD22	223,442	1	1	4	42/370	Molecule3D	3,899,647	1	0	8	1/137
MultixQM9	133,631	229	0	5	3/29	NablaDFT	1,275,340	1	1	8	8/57
OrbNet D.	2,338,889	2	0	17	2/74	Pub. PM6	189,890,155	1	0	70	1/215
Pub. B3lyp	85,915,773	1	0	70	1/215	QM7	7165	1	0	5	2/23
QM7-X	4,195,192	2	1	6	4/23	QM8	21,786	2	0	5	3/8
QM9	133,885	1	0	5	3/9	Qmugs	1,992,984	2	0	10	4/228
RevMD17	999,988	1	1	4	9/24	SN2 React.	452,709	1	0	6	2/6
Sol. Prot.	2,731,180	1	1	5	2/120	Spice	1,110,165	1	1	15	2/110
SpiceV2	2,008,628	1	1	17	2/110	tmQM	86,665	1	0	44	5/569
Transition1x	9,654,813	1	1	4	4/23	WaterClusters	4,464,740	1	2	2	9/90
Alchemy	202,579	1	0	4	11/38	ANICCXv2	489,457	6	0	4	2/55
BPA	13,993	1	1	4	27/27	MACEOFF	1,001,200	1	1	10	3/150
QM7xv2	4,195,192	3	1	6	4/23	QMugsv2	1,992,941	3	0	10	4/228
QM7b	7211	76	0	6	4/60	SpiceLv2	2,004,893	3	1	17	2/110
SCANWater	322	19	0	8	1/23	VQMd24	1,104,982	1	0	5	1/21
PtrFrags	2,731,986	1	0	5	2/120	MDDataset	11,819	1	0	5	162/321
QM1B	1,000,000,000	1	0	5	9/11	DESSM	4,955,938
Potential Total	1,400,126,279	395	16	70	1/370
DES370K	370,959	14	0	2	2/44	DESSM	4,955,938	17	0	14	2/34
DESS86	66	17	0	4	6/34	DESS86x8	528	17	0	4	6/34
Metcalf	13,415	5	0	4	12/41	X40	40	5	0	9	7/25
L7	7	8	0	4	48/112	Splinter	1,677,830	20	0	10	2/51
Interaction Total	7,018,783	0	20	2	2/112

The Open QDC Library

The OpenQDC Python library makes it easy to work with all of the quantum datasets in the hub. It’s a package that aims to provide a simple and efficient way to download, load, and utilize various datasets. You can download datasets with just one line of code. 

A simple pythonic API: The simplicity of the Python interface ensures ease of use, making it perfect for quick prototyping.
ML-Ready: All you manipulate are torch.Tensor, jax.Array or numpy.Array objects.
Quantum ready: The quantum methods used by the datasets are checked and standardized to provide additional values, useful normalization, and different statistics.
Standardized: The datasets are written in standard and performant formats with annotated metadata like units and labels
Performance matters: Read and write multiple formats (memmap, zarr, xyz, etc).
Data: Have access to 1.5+ billion data points.
Open source & extensible: OpenQDC and all its files and datasets are open source, and you can add your own dataset and share it with the community in just a few minutes.

Getting Started

Install OpenQDC with pip or conda:

Python

pip install openqdc
or
conda install openqdc -c conda-forge

Now you are ready to use all our QM datasets with the ready-to-use CLI:

Unset

openqdc download SpiceV2

Or using the Python API:

Python

from openqdc import SpiceV2

# Automatically download the data
dataset=SpiceV2()

Below is a glimpse of how easy it is to use OpenQDC and how it interfaces with torch and torch_geometric:

Python

# Load the dataset
from openqdc import MACEOFF
from torch.data.utils import DataLoader

dataset=MACEOFF(energy_unit=“ang”,energy_unit=“kj/mol”,array_format=”torch”)

# Create the dataloader by simply passing the dataset
dataloader=DataLoader(dataset, batch_size=32)

# Do your own magic
. . .

OpenQDC being framework agnostic can be easily used with torch_geometric, in this case, we can use the function radius_graph from torch_cluster to create a graph:

Python

from openqdc import SpiceV2
from torch_cluster import radius_graph
from torch_geometric.loader import DataLoader
from torch_geometric.data import Data

# We create a function to convert object into their graph
def to_pyg_data(x):

# or any other techniques to build a graph (or use the smiles from the dataset)
edge_index = radius_graph(x.positions, 5)
return Data(edge_index=edge_index, **x)

# Use the transform attribute to automatically convert your items
ds=SpiceV2(array_format=”torch”, distance_unit=”ang”, transform=to_pyg_data)

# Create the pyg dataloader by simply passing the new dataset
loader = DataLoader(ds, batch_size=32, shuffle=True)

# Do your own magic
. . .

We hope OpenQDC can be a great resource for the community to advance MLIP research towards a future of training universal potentials with greater generalizability and robustness.

Please feel free to share your feedback or connect with the Valence Labs team on GitHub, X, LinkedIn, or Valence Portal!

Want to learn more about OpenQDC?

Get in touch with our experts today!