index cache bloom hash tree LSM scan layer merge sort probe partition calc AI SYSTEMS CALCULUS
Harvard Computer Science

The End of Hand-Crafted Systems

My research pursues a fundamental shift: self-designing data and AI systems. By discovering the alphabets and grammars that govern system architectures, we enable machines—not humans—to invent entirely new systems.

The north star: helping bring the full power of data and AI to every scientist, engineer, and decision-maker.

Gordon McKay Professor Faculty Co-Director, Harvard Data Science Initiative Director, Harvard Data and AI Systems Lab Founder, Leibniz Labs (stealth)
SCROLL

A Design Space Beyond Human Reach

The AI revolution is transforming every field and industry, driving unprecedented demand for data-centric computation. As new AI and data applications, data types, hardware platforms, and workloads appear faster than ever, the backbone systems that power this revolution have become the bottleneck.

Training a new model, solving an end-to-end enterprise use case or scientific inquiry depends on how fast we can build tailored systems and execute AI experiments and workflows.

Yet a single system architecture faces a design space larger than 10100 alternatives. Today, we still cling to a handful of "good" templates, each requiring years of manual design and tuning—and each suited to only a narrow set of scenarios. It is time to abandon this artisanal practice.

Alphabets, Grammars, Calculators

The insight: model the design space of systems as an alphabet of low-level design primitives, and whole architectures as sentences in a grammar over that alphabet. Systems calculators can then synthesize fresh blueprints on demand that never existed before.

01 — PRIMITIVES

The Alphabet

Decompose systems into their fundamental design atoms—the smallest decisions that shape how data is laid out, accessed, and processed.

02 — COMPOSITION

The Grammar

Define rules for how primitives combine into coherent architectures, from model training pipelines to full data and inference engines.

03 — SYNTHESIS

The Calculator

Build math- and ML-driven algorithms that navigate this design space, finding optimal designs for specific workloads, hardware, and constraints.

VLDB KEYNOTE, London 2025

Alphabets, Grammars, Calculators, and the End of Hand-Crafted Systems

📄
View Slides

From Theory to Systems

RESULT 01
The Data Calculator
Data structures
1048 designs explored
Seconds to invent and synthesize

Data structures are one of the most fundamental elements of computer science. Roughly 5000 data structures have been invented/published since the dawn of computer science. By capturing the first principles of data layout—how nodes organize data and relate to each other—the Data Calculator invents, codes and explores trillions of previously unknown data structure variants. The periodic table of data structures gives a high level summary of this work.

RESULT 02
Cosine & Limousine
NoSQL
10100 candidate designs
1000× faster

NoSQL systems power AI, Blockchain, Analytics and so much more. Cosine and Limousine generate automatically novel NoSQL stores running up to three orders of magnitude faster than today's best deployments. Designs spans from LSM-trees to B-trees to hash tables—and trillions of hybrids that exist nowhere in literature or industry.

RESULT 03
The Image Calculator
Image AI
Storage/neural network co-design
10× faster

Today Image AI is primarily driven by JPEG storage, but JPEG was created for human eyes decades ago. Images today are seen by algorithms... The Image Calculator synthesizes a massive design space for image storage and invents new designs tailored for neural networks. By optimizing storage and neural network design together, we can achieve order-of-magnitude speedups in end-to-end vision pipelines.

RESULT 04
LegoAI & TorchTitan
Training large models
Max hardware utilization
No failures

Training very large models is slow, expensive and error-prone governed by extremely complex and ever evolving algorithms. LegoAI creates a massive design space and can invent and implement entirely new training algorithms instantly given a neural network, a performance goal and hardware constraints. The novel distributed-training algorithms extract every flop and byte from modern accelerators, automatically adapting to hardware topology and model architecture. LegoAI has been productized into TorchTitan and ships with PyTorch.

Why Calculators and the Dream of Leibniz

The dream of automated reasoning is as old as science itself. In 1666, Gottfried Wilhelm Leibniz completed his doctoral dissertation with a radical vision: a calculus ratiocinator, a universal machine that could derive all truths from a small alphabet of primitive concepts and their combinations. He imagined that disputes could be settled not by argument but by calculation.

Leibniz De Arte Combinatoria 1666
1666

Leibniz's De Arte Combinatoria

The original diagram from Leibniz's dissertation depicting a combinatorial wheel of primitive concepts that generate knowledge through systematic combination.

The Data Calculator Design Primitives
2018

The Data Calculator

Our realization of Leibniz's vision for data systems: design primitives that combine to synthesize optimal data structures for any workload. We now apply this across the full AI stack.

"Let us calculate."
— Gottfried Wilhelm Leibniz

Influences

My work has been inspired by numerous computer science pioneers. Beyond my mentors Martin Kersten and Anastassia Ailamaki, the self-designing systems vision draws from foundational ideas across decades of research: S. Bing Yao's early work on system modeling and storage advisors, Joe Hellerstein's extensible indexing, Don Batory's principles of modular system synthesis, and Stefan Manegold's model synthesis for hardware-conscious algorithms. Mike Franklin's work on discovering new algorithms through taxonomies of design principles follows the same philosophy we now apply to entire system architectures.

Machines Write the Sentences.
Humans Ask Deeper Questions.

These results signal a future in which systems research increasingly focuses on crafting richer alphabets and grammars while machines write the sentences—freeing designers and researchers to pursue more profound questions.

Practitioners will dial in cost, latency, and accuracy with surgical precision.

Vision Paper
The Automatic Scientist
CIDR 2017
Keynote
Self-Designing AI
McKinsey Data & AI Summit 2025

Expanding the Grammar of Intelligence

Building on the foundations of self-designing systems, we are now extending these principles to the full stack of modern AI infrastructure.

01 — RETRIEVAL

RAG Agents

Applying self-designing principles to retrieval-augmented generation, enabling systems that automatically synthesize optimal retrieval strategies, index structures, and agent orchestration patterns.

02 — CONTEXT

Managing Context

Developing grammars for context management that allow systems to self-design how they store, compress, retrieve, and reason over long-range dependencies so we can apply AI to ever more complex problems.

03 — COMPILATION

Large Model Compilers

Creating compilers that transform model specifications into optimized execution plans, automatically navigating the vast design space of hardware mappings, parallelization strategies, and memory hierarchies.

04 — ADAPTATION

Model Fine-Tuning

Extending the calculator paradigm to model adaptation, synthesizing optimal fine-tuning recipes by reasoning over the design space of data selection, parameter-efficient methods, and training dynamics.

Origins: Database Cracking

Systems That Learn from Their Workload

The ideas behind self-designing systems trace back to my PhD work on Database Cracking with my amazing advisors Martin Kersten and Stefan Manegold—a paradigm where data systems continuously adapt their physical storage layout in response to the queries they receive.

"Every query is advice on how data should be stored."

Rather than requiring administrators to manually create indexes upfront, cracking systems treat each query as an opportunity to incrementally reorganize data. Over time, the storage layout converges to one that is perfectly tailored to the actual workload—adapting to data properties, query patterns, and hardware characteristics.

Self-designing systems take this philosophy to its logical extreme: if a system can learn to optimize its storage layout, why not learn to optimize its entire architecture?

Research Group

Alumni & Placement

PhD

  • Brian Hentschel Google Research
  • Sanket Purandare Meta, PyTorch
  • Lukas Maas Microsoft Research
  • Abdul Wasay Intel Research
  • Michael Kester Vertica
  • Wilson Qin 2040 Labs

Postdoctoral Researchers

  • Manos Athanassoulis Professor, Boston University
  • Niv Dayan Professor, University of Toronto
  • Siqiang Luo Professor, NTU Singapore
  • Konstantinos Zoumpatianos Snowflake
  • Hao Jiang Databricks
  • Subarna Chatterjee DataStax

Current Researchers

PhD

  • Milad Rezaei Hajidehi
  • Konstantinos Kopsinis

Postdoctoral Researchers

  • Qitong Wang
  • Utku Sirin

Research Sponsors

Stratos Idreos

Stratos Idreos

Gordon McKay Professor of Computer Science · Harvard University

I am a Professor at Harvard's John A. Paulson School of Engineering and Applied Sciences and Faculty Co-Director of the Harvard Data Science Initiative. I lead DASlab, the Harvard Data and AI Systems Laboratory, where my research pursues a "grammar of systems"—enabling machines to design and tune systems architectures that are tailored to their context, faster, and more scalable.

Before Harvard, I was a researcher at CWI Amsterdam and earned my PhD from the University of Amsterdam. I have co-chaired ACM SIGMOD 2021 and IEEE ICDE 2022, co-founded the ACM/IMS Journal of Data Science, and served as chair of the ACM SoCC Steering Committee.

Speaker Bio

CIDR Test-of-Time Award 2026
Sloan Research Fellowship 2023
Harvard McDonald Mentoring Award 2023
ACM SIGMOD Test-of-Time Award 2022
ACM SIGMOD Contributions Award 2020
DOE Early Career Award 2019
IEEE TCDE Rising Star Award 2015
NSF CAREER Award 2015
ACM SIGMOD Jim Gray Dissertation Award 2011
ERCIM Cor Baayen Award 2011