
Introducing OpenQDC – The Open-Source Hub of ML-Ready Quantum Datasets
Explore TodayIntroduction
We curated and consolidated 40+ quantum mechanics (QM) datasets, covering 1.5 billion geometries across 70 atom species and 250+ QM methods, into a single, accessible hub called OpenQDC. It’s open-source and the datasets are available for access through the OpenQDC Python library. Install it via pip (pip install OpenQDC) to start downloading and using various QM datasets in just a single line of code.
Github page: https://github.com/valence-labs/openQDC
Website: https://www.openqdc.io/

Challenges with QM Datasets
Developing robust MLIPs requires vast amounts of QM data. Unfortunately, there is a lack of standardized, plug-and-play datasets that can be used to train and test new ML algorithms, hindering the prototyping of new research in this field.
Existing QM datasets span various methods and different chemical spaces. They’re also scattered across several repositories (ex. QCArchive, ColabFit, NablaDFT, GEOM) with missing metadata (e.g. level of theory and units), adding an extra layer of complexity to working these datasets. This not only hampers the adoption and utility of the data, but also stifles opportunities for collaboration among physicists, chemists, ML experts, and experts in other fields, limiting the progress of ML research
The Open QDC Library
The OpenQDC Python library makes it easy to work with all of the quantum datasets in the hub. It’s a package that aims to provide a simple and efficient way to download, load, and utilize various datasets. You can download datasets with just one line of code.
- A simple pythonic API: The simplicity of the Python interface ensures ease of use, making it perfect for quick prototyping.
- ML-Ready: All you manipulate are torch.Tensor, jax.Array or numpy.Array objects.
- Quantum ready: The quantum methods used by the datasets are checked and standardized to provide additional values, useful normalization, and different statistics.
- Standardized: The datasets are written in standard and performant formats with annotated metadata like units and labels
- Performance matters: Read and write multiple formats (memmap, zarr, xyz, etc).
- Data: Have access to 1.5+ billion data points.
- Open source & extensible: OpenQDC and all its files and datasets are open source, and you can add your own dataset and share it with the community in just a few minutes.
Getting Started
Install OpenQDC with pip or conda:
Python
pip install openqdc
or
conda install openqdc -c conda-forge
Now you are ready to use all our QM datasets with the ready-to-use CLI:
Unset
openqdc download SpiceV2
Or using the Python API:
Python
from openqdc import SpiceV2
dataset=SpiceV2()
Below is a glimpse of how easy it is to use OpenQDC and how it interfaces with torch and torch_geometric:
dataset=MACEOFF(energy_unit=
dataloader=DataLoader(dataset, batch_size=32)
. . .
OpenQDC being framework agnostic can be easily used with torch_geometric, in this case, we can use the function radius_graph from torch_cluster to create a graph:
def
edge_index = radius_graph(x.positions, 5)
return Data(edge_index=edge_index, **x)
ds=SpiceV2(array_format=”torch”, distance_unit=”ang”, transform=to_pyg_data)
loader = DataLoader(ds, batch_size=32, shuffle=True)
. . .
We hope OpenQDC can be a great resource for the community to advance MLIP research towards a future of training universal potentials with greater generalizability and robustness.
Please feel free to share your feedback or connect with the Valence Labs team on GitHub, X, LinkedIn, or Valence Portal!
