"A central governance failure of modern AI is epistemic: we lack shared, verifiable ways to know what went into systems and what that implies."
Over two days in Cambridge, we convened researchers, archivists, open-source builders, legal and policy experts, and cultural institutions to explore what it would take to build a consent-aware, provenance-rich substrate for public AI.
Many frontier AI systems are trained on vast corpora with unclear provenance, ambiguous consent, uneven cultural representation, and no shared canonical grounding. Legal and policy interventions lag behind technical reality, while private licensing solutions risk accelerating enclosure of the open web.
Human institutions historically solved analogous problems through libraries, archives, editorial systems, courts, peer review, and shared public memory. AI systems today lack a widely adopted equivalent. The AI Commons exists to prototype that missing layer.
By the end of Day 2, discussions consistently resolved into three interdependent workstreams. They are not sequential — each depends on the others.
Provenance is more than licensing — it includes existence, access conditions, versioning, and transformation history. Metadata itself may become the primary IP and governance artifact.
Registries must support human vs. AI-generated distinctions, training preferences and consent signals, third-party attestations, and federation rather than a single global database. Alignment with libraries, Europeana, DPLA, IIIF, and the Internet Archive is essential.
Copyright law is insufficient to express moral authorship, consent, or downstream impact. AMPL and related approaches define AI-specific legal primitives — not just licenses.
Starting with public-domain and lawful custodianship as the lowest-risk wedge. Creating legible options for creators: opt-in, opt-out, conditional participation. Separating restrictions on use from restrictions on sharing or training.
Compute access is a bottleneck, but coordination and tooling are equally constraining. Small, high-quality, well-documented data can be disproportionately powerful. Evaluation infrastructure is a democratization lever.
A public-domain "Vintage" model (world knowledge constrained to ~1930 by jurisdiction), improved OCR pipelines for PD books and newspapers, transparent training pipelines with data manifests and versioned releases.
Researchers, archivists, legal scholars, open-source leaders, and public-interest technologists from across the ecosystem — attending in person at Berkman Klein Center and Parsnip, and virtually.
The workshop closed by surfacing key decisions ahead. These are not rhetorical — they are the live questions the Commons coalition is actively working through.
Which pilot do we commit to first — and which do we explicitly defer?
What is the minimum viable registry that is genuinely useful (not just real)?
Where can we merge with or reinforce existing efforts instead of duplicating them?
What constitutes a near-term "win" — technical, legal, or institutional?
How do we fund metadata and governance infrastructure sustainably?
How do we prevent recreating DRM under a different name — where is the line between transparency and control?
"Build things that matter even if we fail."
Participants repeatedly emphasized that success is not total victory — it is irreversibility. The goal is to reach a point where Commons infrastructure becomes hard to ignore, and impossible to erase.
We are researchers, builders, archivists, and legal architects working to build the shared infrastructure layer that modern AI has been missing. Reach out if you want to contribute.
hello@aicommons.ccaicommons.cc