Python, and more specifically, Jupyter Notebooks enable data analysis workflows to be reproducible without very much overhead. This talk will discuss how to use Jupyter Notebooks and the python ecosystem to incrementally improve the reproducibility, efficiency, and depth of your existing analysis workflow.
The need to clean, manipulate, and visualize data is a skill that is increasingly in demand due to the abundance of data available. There are many methods of performing data analysis, each with its own benefits and drawbacks. Of the many features that can be used to evaluate a data analysis method, we will focus on ease of use and reproducibility.
It is crucially important for data analysis to be reproducible from starting data. A reproducible workflow allows others to review your work and can help catch assumptions, bugs, and steps that are not transparent in the final result. Reproducibility improves the transparency of your analysis and therefore makes your results more resistant to misinterpretation.
Python and, more specifically, Jupyter Notebooks enable workflows to be reproducible with very little overhead. Due to these characteristics, Jupyter Notebooks are the current medium for most data science work.
One of largest problems with switching to a reproducible workflow is amount of time you need to sink into converting to the new workflow. I will demonstrate how a data analysis workflow can be modified using python libraries and Jupyter Notebooks.
I'm a physics graduate student at The Ohio State University. I work in a biophysics research group that produces and analyzes a lot of spectroscopic data. One major part of the research our group performs is the analysis of our measurements. I've taken on the task of updating our data analysis workflow to one that is more reproducible and automated. My approach has been to produce a pure python workflow using Jupyter notebooks and a library I have made to streamline typical tasks. This project has required me to learn a lot about how to do data analysis well. I surely didn't have the opportunity to take any classes on how to perform large scale data analysis in my studies. Instead, most of the resources I've used have been a collection of excellent resources scattered across academic literature and blog posts by data science professionals. I currently use python for my entire data analysis workflow and am working to improve other’s analysis into this style. I'd like to give back to the python community by sharing what I've learned so far.