Social science research often relies on scans of documents such as statistical tables, newspapers, firm level reports, etc.
Unfortunately, OCR is not designed to detect document layouts, except in cases where layouts are extremely simple. The below figures show typical OCR bounding boxes. Much of the text is not detected, and some is detected twice or scrambled. The OCR cannot distinguish different text types, ie headlines v captions v articles. This means OCR alone cannot power the end-to-end conversion of document image scans into structured databases.
We have released an open-source deep-learning powered library, Layout Parser, that provides a variety of tools for automatically processing document image data at scale. Webpage; Arxiv; Github
Contrast the off-the-shelf OCR with the layout detection results we achieve through Layout Parser’s deep learning powered full document image analysis pipelines. The colors of the bounding boxes denote different types of text regions that are automatically classified by our DIA pipelines. Automatic classification of different meaningful text regions is required to automate the conversion of raw document scans into structured databases.
We are currently using Layout Parser to process tens of millions of such documents
Layout Parser is not just for English. Here’s another example, a complex historical table from Japan.
These are the Layout Parser functionalities (community platform under construction):
Layout Parser currently has some pre-trained models, and the pipelines for the above examples will be integrated when finalized. We are working to expand the types of documents it can process off-the-shelf.
With Layout Parser, you can train your own customized DL-based layout models. Because our pre-trained model zoo is currently small, right now Layout Parser is mostly useful for designing your own customized models, with the pre-trained models providing a useful starting point via transfer learning.
Don’t have labeled data? Layout Parser incorporates a data annotation toolkit that makes it more efficient to create labeled data.
Amongst its varied functionalities is a perturbation-based scoring method to select the most informative samples to label.
Layout Parser builds wrappers to call OCR engines and comes with a CNN-RNN customizable OCR model.
Layout Parser provides a flexible output structure to facilitate diverse downstream analyses.
Layout Parser is implemented with simple APIs and can perform off-the-shelf layout analysis with four lines of Python code.
No background in deep learning? See the knowledgebase section of this site for lecture videos from my course on deep learning for data curation.