About this knowledge base

January 23, 2021 - 3 minute read - Category: Overview - Tags: Overview

This knowledgebase shares the information and references compiled for the PhD course: Unleashing Novel Data at Scale (spring 2023). I first taught this course in 2021 as a major pandemic project, born from the observation that many important social science questions remain unanswered, in substantial part because the data required to examine them has traditionally been inaccessible. Sometimes data may be inaccessible because it is trapped in hard copy document scans or other types of image data; other times, information can be scattered throughout vast reams of text. Processing unstructured data is particularly challenging for lower resource settings, meaning that economics oftentimes doesn’t fully explore the diversity of human socieities. Recent advances in deep learning offer enormous potential to convert raw unstructured information into computable data at scale, including for low resource settings.

This knowledgebase consists of material from 18 lectures covering recent advances in computer vision and natural language processing, applied to social science data curation. The course was completely redesigned in 2023, with many of the covered papers being published since it was last offered in 2021.

The deep learning literature is vast and moves very quickly. I initially found it quite overwhelming to find the contributions that were both the most powerful and the most relevant to social science. The main contribution of the knowledge base is to provide links to the various academic papers, online course notes/videos, and other online resources that I’ve found most useful for understanding how deep learning can be used to convert raw information into meaningful, structured data for social science research. Links to my lecture slides and videos are also included, and I also intermingle some practical advice on applying these methods based on the experience of my research group, including our efforts to develop the Layout Parser library and our application of computer vision and NLP methods to tens of millions of page scans of historical documents.

Some of the material covered in the knowledge base - like observations from my own research experience - is unique to this source, whereas in other cases I’m summarizing papers and methods for which there are a large number of online resources, some of which do a far better job with tons of visual animations, etc. teaching the material than I possibly could. All of these references are linked and cited in my slides/lectures, and I try to make the knowledge base a reasonably easy-to-use buffet, where you can click through to whatever resources seem most relevant and helpful. The course is about the concepts, not a how-to of coding in Python or PyTorch, though some references with relevance to coding are included.

I personally found it really time-consuming to delve into the literature and figure out what was relevant to me as a social science researcher, with very few resources tailored towards social science applications out there. I hope that the information compiled in this knowledge base will make it at least a little bit easier for others to navigate and follow this literature. The literature moves quickly, and I will do my best to add additional posts as our research group gains new insights from our work to apply deep learning methods to social science problems.