Large datasets are vital for the majority of analytic and machine learning tasks. But what happens when the data you need isn't available in some convenient and easily obtainable form? This talk will go through the process of data scraping to create a dataset that can be then used for various analytical or machine learning tasks.
It is essential to have a very large and high-quality dataset in order to perform significant analytics or to use in various machine learning tasks. For some tasks, there exists simple APIs or repositories of data to collect from. But for many other tasks like tracking prices of products, predicting stock prices, and predicting outcomes of sports games there isn't a convenient way to retrieve this information besides a webpage. Because of these circumstances, learning to scrape data from webpages and other sources allows us to create our own dataset. Additionally, scraping grants us the ability to ask better questions about data in the world.
This talk is geared towards beginner-to-intermediate Python developers that want to be able to ask and answer better questions through data. This talk will provide a guide for web scraping through two examples, and it will explain how to get the scraped data into a usable form. Throughout the talk, I will highlight some tips for improving scraper performance, minimizing the risk that a web server will stop you, and different ways to store the collected data. The first of the two examples will examine a simple case of scraping data about the lottery and the second will explore a more challenging case of scraping course information from a University.