Sunday 1:30 p.m.–2:20 p.m.

Building a world class document pipeline using Python

Andrew Wolfe

Audience level:


At Brokersavant, we process large quantities of real estate assets ranging from commercial property flyers to large real estate leases and our customers expect a lightning fast turn around. Learn how we leveraged open source technologies and Python libraries to create a system that scales to millions of assets per day without missing a beat.


Processing documents isn't just about loading them using file() and extracting the text right from the document. Bad scans, images, mis-spellings, foreign languages, hundreds of document/image types and other reasons prevent us from taking the easy route to processing document assets we require in our software systems. In this talk, We'll dive into some practices I've learned from solving real world problems extracting documents such as leases, flyers and real estate comparison sheets from various global corporations and fortune 100 companies at scale. We will discuss the following topics that will help take your document processing to the next level:

  • Creating an asset pipeline using Celery, Redis, Docker, and Amazon s3
  • Using Elastic and NLTK to extract meaning from the document and make decisions within your pipeline
  • Protecting against edge cases in our pipeline
  • Using open source technologies to standardize documents into a standard input type
  • Creating a smarter engine using sklearn.
  • Turning Errors into Improvements