Investigating AI datasets: A journalist's guide

Kathleen Siminyu, Christo Buschek

Bonn: Deutsche Welle DW Akademie (2025)

"Algorithms today are more complex than ever. With the rise of large language models (LLMs), they have become increasingly unexplainable and non-deterministic. Even experts often fail to understand why an algorithm yields a particular outcome. It is through the datasets used to train those algorithms that we start to understand them. Because when we look closer at the datasets used to train these incredibly complex machines, we recognize the models that they power and learn about the emerging effects of algorithmic systems. We all have interacted with algorithmic systems in our lives many times over. And many of you, I'm sure, have used a generative language model at some point, either for your work or privately. These machines are indeed impressive. We can't help but be dazzled. But are they working the way we expect them to? And in what context do they work? For whom? In this course we want to touch on the following topics: What is the importance of datasets on algorithmic systems? How are datasets, used to train AI, constructed? What can we learn from dataset curation when we work with data ourselves? What are some of the biases that we find in datasets? (Publisher description)

Introductory video: Why dataset investigation matters -- What is data, what are datasets? -- AI, ML, LLM: Key terms -- Datasets in machine learning -- Dataset curation and bias -- Deepdive: Languages in datasets -- Deepdive: LAION-5B -- Working with data