index cache bloom hash tree LSM scan layer merge sort probe partition calc AI SYSTEMS CALCULUS
Harvard Computer Science

The End of Hand-Crafted Systems

My research pursues a fundamental shift: self-designing data and AI systems. By discovering the alphabets and grammars that govern system architectures, we enable machines—not humans—to write the sentences.

Gordon McKay Professor Faculty Co-Director, Harvard Data Science Initiative Director, Harvard Data and AI Systems Lab Founder, Leibniz Labs (stealth)
SCROLL

A Design Space Beyond Human Reach

The AI revolution is transforming every field and industry, driving unprecedented demand for data-centric computation. As new data types, hardware platforms, and workloads appear faster than ever, the backbone systems that power this revolution must evolve just as quickly.

Yet a single system architecture faces a design space larger than 10100 alternatives. We still cling to a handful of "good" templates, each requiring years of manual design and implementation tuning. It is time to abandon this artisanal practice.

Alphabets, Grammars, Calculators

The breakthrough: model the design space of systems as an alphabet of low-level design primitives, and whole architectures as sentences in a grammar over that alphabet. Systems calculators can then synthesize fresh blueprints on demand.

01 — PRIMITIVES

The Alphabet

Decompose systems into their fundamental design atoms—the smallest decisions that shape how data is laid out, accessed, and processed.

02 — COMPOSITION

The Grammar

Define rules for how primitives combine into coherent architectures, enabling systematic exploration of the entire design space.

03 — SYNTHESIS

The Calculator

Build engines that navigate this space intelligently, finding optimal designs tailored to specific workloads, hardware, and constraints.

VLDB 2025 KEYNOTE

Alphabets, Grammars, Calculators, and the End of Hand-Crafted Systems

A comprehensive presentation of this research vision, demonstrating how self-designing systems are transforming data infrastructure, machine learning, and AI at scale.

📄
View Slides

From Theory to Systems That Work

RESULT 01
The Data Calculator
1018 designs explored
Seconds to synthesize

The first engine for interactive data structure design. By capturing the first principles of data layout—how nodes organize data and relate to each other—the Data Calculator explores trillions of previously unknown data structure variants to find optimal layouts without implementation or even hardware access.

RESULT 02
Cosine & Limousine
1000× faster
1036 candidate designs

Self-designing key-value storage engines that generate novel NoSQL stores running up to three orders of magnitude faster than today's best deployments. Cosine spans designs from LSM-trees to B-trees to hash tables—and trillions of hybrids that exist nowhere in literature or industry.

RESULT 03
The Image Calculator
10× faster vision pipelines
Joint optimization

Extends the self-designing paradigm to vision systems, co-designing entirely new storage formats alongside neural network architectures. By optimizing both together, we achieve order-of-magnitude speedups in end-to-end vision pipelines.

RESULT 04
LegoAI & TorchTitan
Novel training algorithms
Maximum hardware utilization

Applies self-designing principles to distributed training of large AI models. These systems invent novel distributed-training algorithms that extract every flop and byte from modern accelerators, automatically adapting to hardware topology and model architecture. TorchTitan ships with PyTorch.

Machines Write the Sentences.
Humans Ask Deeper Questions.

These results signal a future in which systems research increasingly focuses on crafting richer alphabets and grammars while machines write the sentences—freeing designers and researchers to pursue more profound questions.

Practitioners will dial in cost, latency, and accuracy with surgical precision. The era of hand-crafted systems is ending. The era of self-designing systems has begun.

Vision Paper
The Automatic Scientist
CIDR 2017
Keynote
Self-Designing AI
McKinsey Data & AI Summit 2025

Expanding the Grammar of Intelligence

Building on the foundations of self-designing systems, we are now extending these principles to the full stack of modern AI infrastructure.

01 — RETRIEVAL

RAG Agents

Applying self-designing principles to retrieval-augmented generation, enabling systems that automatically synthesize optimal retrieval strategies, index structures, and agent orchestration patterns tailored to specific knowledge domains.

02 — CONTEXT

Managing Context

Developing grammars for context management that allow systems to self-design how they store, compress, retrieve, and reason over long-range dependencies—optimizing the fundamental bottleneck of modern AI systems.

03 — COMPILATION

Large Model Compilers

Creating compilers that transform model specifications into optimized execution plans, automatically navigating the vast design space of hardware mappings, parallelization strategies, and memory hierarchies.

04 — ADAPTATION

Model Fine-Tuning

Extending the calculator paradigm to model adaptation, synthesizing optimal fine-tuning recipes by reasoning over the design space of data selection, parameter-efficient methods, and training dynamics.

Where It All Started: Database Cracking

Systems That Learn from Their Workload

The ideas behind self-designing systems trace back to my PhD work on Database Cracking with my amazing advisors Martin Kersten and Stefan Manegold—a paradigm where data systems continuously adapt their physical storage layout in response to the queries they receive.

"Every query is advice on how data should be stored."

Rather than requiring administrators to manually create indexes upfront, cracking systems treat each query as an opportunity to incrementally reorganize data. Over time, the storage layout converges to one that is perfectly tailored to the actual workload—adapting to data properties, query patterns, and hardware characteristics.

Self-designing systems take this philosophy to its logical extreme: if a system can learn to optimize its storage layout, why not learn to optimize its entire architecture?

Stratos Idreos

Stratos Idreos

Gordon McKay Professor of Computer Science · Harvard University

I am a Professor at Harvard's John A. Paulson School of Engineering and Applied Sciences and Faculty Co-Director of the Harvard Data Science Initiative. I lead DASlab, the Harvard Data and AI Systems Laboratory, where my research pursues a "grammar of data systems"—enabling machines to design and tune systems architectures that are tailored to their context, faster, and more scalable.

Before Harvard, I was a researcher at CWI Amsterdam and earned my PhD from the University of Amsterdam. I have co-chaired ACM SIGMOD 2021 and IEEE ICDE 2022, co-founded the ACM/IMS Journal of Data Science, and currently serve as chair of the ACM SoCC Steering Committee.

CIDR Test-of-Time Award 2025
Sloan Research Fellowship 2023
Harvard McDonald Mentoring Award 2023
ACM SIGMOD Test-of-Time Award 2022
ACM SIGMOD Contributions Award 2020
IEEE TCDE Rising Star Award 2015
ACM SIGMOD Jim Gray Dissertation Award 2011
ERCIM Cor Baayen Award 2011
NSF CAREER Award
DOE Early Career Award