Lazy is too hard. When I had 40,000 PDFs and needed to extract their data, I knew that the "lazy" approach was insufficient. This talk reviews tools to tame PDFs with confidence. I'll use my open-data project's workflow as an example (ETL anyone?). It's also a follow-up/response to PyOhio2016's "We Don’t Need No Stinkin’ PDF Library: Build PDFs with Python the Lazy Way".
Imagine that you have access to 40,000 pdfs. How might you make this data useful, searchable, or maybe even friendly? While lazy has it’s place, this challenge needs a robust solution. Several Python accessible solutions exist. I'll tell the story of how I made this process work for me and discuss the tools I used to make the data friendly.
I’ll focus on extracting data mostly with the following tools - Apache Tika Report Lab Apache PDFBox
Journalists used Apache Tika to gather data for the Pulitzer Prize winning Panama Papers by the International Consortium of Investigative Journalists in 2016. Tika focuses on extraction of not just PDFs, but also other common document formats. It's a great tool for most simple jobs like a full-text dump.
ReportLab and PDFBox allow users to create, manipulate, and extract data from PDF documents. These are useful when the data needs to be parsed or used with a data pipeline. I’ll review my process for parsing data from a complex PDF with PDFBox via Jython.
I’ll touch on the PDF as a specification during the talk. It can include a huge variety of data including text, images, and binary files. It can also be encrypted. Once a user learns the basic aspects of the specification, working with it becomes clear. This includes identifying text objects and using spatial boxes to extract specific data from a page.