Rapid prototyping in data science often hits a wall when data is too large to fit into memory. When this happens, teams are often confronted with two options: sampling techniques or porting to Apache Spark. Both have significant drawbacks. In this talk, I'll demonstrate how to leverage Dask and Scikit-learn to solve this problem.
Rapid prototyping is the hallmark of many data science projects where ideation, testing, and iteration are the norms. It is through this rapid, iterative process that robust solutions are identified. However, there often comes a time in the rapid prototyping process where projects stall because the increasingly sophisticated prototypes require higher volumes of data that just won't fit into memory. Teams often take one of two approaches when this happens: sampling techniques or porting to Apache Spark. Both methods have significant drawbacks. By using out-of-core computation methods and an online machine learning approach, rapid prototyping can be extended leading to more robust solutions in far less time. In this talk, I'll demonstrate how to leverage Dask and Scikit-learn to solve this problem.