This page is a synthesis artifact of the AI Commons December 2025 Kickoff · Cambridge, MA · Berkman Klein Center & Parsnip · Not a formal consensus statement
December 15–16, 2025 · Berkman Klein Center & Parsnip · Cambridge, MA

the ai
commons

§ kickoff synthesis · a shared substrate for consent-aware, provenance-rich public ai
§ the gathering
2
Days of working sessions
30+
Researchers, builders, archivists & legal experts
3
Interdependent workstreams that emerged

"A central governance failure of modern AI is epistemic: we lack shared, verifiable ways to know what went into systems and what that implies."

Over two days in Cambridge, we convened researchers, archivists, open-source builders, legal and policy experts, and cultural institutions to explore what it would take to build a consent-aware, provenance-rich substrate for public AI.

Many frontier AI systems are trained on vast corpora with unclear provenance, ambiguous consent, uneven cultural representation, and no shared canonical grounding. Legal and policy interventions lag behind technical reality, while private licensing solutions risk accelerating enclosure of the open web.

Human institutions historically solved analogous problems through libraries, archives, editorial systems, courts, peer review, and shared public memory. AI systems today lack a widely adopted equivalent. The AI Commons exists to prototype that missing layer.

§ 01

what the ai commons is (and is not)

what it is
  • A modular infrastructure layer for lawful, culturally grounded AI development
  • A consent-aware, provenance-rich substrate combining legal tools, technical standards, registries, and institutions
  • Designed to work first and best for the public domain
  • An interoperability layer built with libraries and archives — not a replacement for them
  • A coalition of people architecting the foundations together
  • A cultural protocol as much as a technical one
what it is not
  • A single centralized repository or global database
  • A claim that one legal theory (fair use, TDM exceptions, licenses) will win
  • A closed consortium that gates participation by default
  • An attempt to outspend hyperscalers in frontier model training
  • An imposition of one governance model across every sector or jurisdiction
  • A system that asks institutions to trust invisible pipelines
§ 02

three pillars of convergence

By the end of Day 2, discussions consistently resolved into three interdependent workstreams. They are not sequential — each depends on the others.

§ pillar i
data & provenance standards
"Make existence, origin, and transformation legible."

Provenance is more than licensing — it includes existence, access conditions, versioning, and transformation history. Metadata itself may become the primary IP and governance artifact.

Registries must support human vs. AI-generated distinctions, training preferences and consent signals, third-party attestations, and federation rather than a single global database. Alignment with libraries, Europeana, DPLA, IIIF, and the Internet Archive is essential.

§ mvp shape
A federated registry supporting machine-readable declarations of origin, consent, and transformation — usable by both humans and training pipelines.
§ pillar ii
legal & policy infrastructure
"Build something that works even if the law changes."

Copyright law is insufficient to express moral authorship, consent, or downstream impact. AMPL and related approaches define AI-specific legal primitives — not just licenses.

Starting with public-domain and lawful custodianship as the lowest-risk wedge. Creating legible options for creators: opt-in, opt-out, conditional participation. Separating restrictions on use from restrictions on sharing or training.

§ mvp shape
A hardened legal toolkit (not a single license) pairing provenance, consent signals, and institutional assurances — initially optimized for public-domain and library-held collections.
§ pillar iii
shared infrastructure & training pilots
"Demonstrate the stack end-to-end."

Compute access is a bottleneck, but coordination and tooling are equally constraining. Small, high-quality, well-documented data can be disproportionately powerful. Evaluation infrastructure is a democratization lever.

A public-domain "Vintage" model (world knowledge constrained to ~1930 by jurisdiction), improved OCR pipelines for PD books and newspapers, transparent training pipelines with data manifests and versioned releases.

§ mvp shape
One or more fully documented, legally conservative training runs demonstrating how Commons infrastructure works in practice.
§ 03

core tensions that shaped the workshop

§ tension a
fair use vs. consent vs. legibility
While fair use may ultimately protect many AI training activities, institutions still need legibility, credible risk-reduction, and alignment with public values — not just legal minimums. The Commons does not assume one legal theory will prevail. It makes multiple lawful paths visible and auditable.
§ tension b
open participation vs. quality & safety
True openness risks fragmentation, duplication, and low-quality contributions. The Commons counters this through modularity over monoliths, clear contribution surfaces with different expertise thresholds, and shared evaluation and metadata infrastructure to keep quality high without centralization.
§ tension c
technical possibility vs. power concentration
Training an LLM from public-domain material is technically feasible today. The harder problem is preventing the hourglass shape: many contributors → few trainers → universal downstream dependence. The Commons is deliberately designed to counter concentration dynamics by broadening who can contribute, verify, and reproduce training in the open.
§ 04

who was there

Researchers, archivists, legal scholars, open-source leaders, and public-interest technologists from across the ecosystem — attending in person at Berkman Klein Center and Parsnip, and virtually.

MIT Media Lab Bayspring Cosmos Institute Creative Commons EleutherAI Tidelift The Lean Startup Anaconda Public Knowledge Advanced AI Society Harvard BKC IBM Funding the Commons Emergent Research MIT (CSAIL) Public AI Italy OpenMined Collaborative Futures Institute Prime Intellect Model Corp AI Common Crawl Wikimedia Foundation Harvard Library Innovation Lab Open Source Initiative Air Signal Santa Fe Institute (fmr.)
§ 05

cultural & global commitments

Multilinguality and multicultural representation are core, not optional. Historical gaps in the record must be actively addressed, not passively inherited.
Libraries and archives are not "data sources" — they are institutional stewards. The Commons is built with them, not around them.
Commons participation must offer real agency to those without bargaining power. This means legible opt-in, opt-out, and conditional participation.
This must be emotionally compelling — not just correct. Fun, rituals, visibility, and recognition matter. The Commons is as much a cultural protocol as a technical one.
No single legal entity can hold this alone. The emerging structure favors resilience over purity: a nonprofit core, public-benefit entities, and a federated network of partners.
AI should not depend on trust without evidence. Consent must be operational, provenance verifiable, and accountability enforceable.
§ 07

what success looks like

"Build things that matter even if we fail."

Participants repeatedly emphasized that success is not total victory — it is irreversibility. The goal is to reach a point where Commons infrastructure becomes hard to ignore, and impossible to erase.

§ near-term markers of success
  • A public artifact or service that institutions genuinely rely on
  • A training run others can reproduce, critique, and extend
  • A legal or provenance primitive that becomes a reference point
  • A coalition that keeps meeting, building, and converging
§ get involved

join the commons

We are researchers, builders, archivists, and legal architects working to build the shared infrastructure layer that modern AI has been missing. Reach out if you want to contribute.

hello@aicommons.cc

aicommons.cc