AI Governance

Datasheets for Datasets

A documentation framework for AI training datasets that describes their composition, collection methodology, preprocessing steps, intended uses, and ethical considerations. Proposed by Gebru et al. in 2018, datasheets bring supply-chain transparency to the data that shapes AI behavior.

Why It Matters

Garbage in, garbage out — but worse, biased data in means biased decisions out. Datasheets force data creators to surface the assumptions, gaps, and potential harms baked into training data before that data shapes a model.

Example

A datasheet for a medical imaging dataset would document the demographic breakdown of patients (age, sex, ethnicity), the hospitals where images were collected, whether informed consent was obtained, and known gaps like underrepresentation of pediatric cases.

Think of it like...

Datasheets are like ingredient lists and sourcing disclosures on food packaging — they tell you where the raw materials came from and how they were handled before reaching your plate.

Related Terms