- Outline the steps to a data science analysis
- Understand the software requirements for the course
6 January 2021
This content draws upon material developed by Jeff Leek & Roger Peng for their Advanced Data Science Course at the Johns Hopkins Bloomberg School of Public Health.
Make your analysis reproducible!
What makes a good Q?
Data sources (in order of confidence):
This is usually the most time intensive step
It’s also the most important & the hardest to teach
Do the data meet your expectations?
Where to start?
Who’s your audience? What’s your medium?
Consider these real scenarios
The goal is to explore the data
This is your “sandbox”
Soften the edges of the raw results
A refined analysis and presentation
Here are some useful questions to ask yourself
Is free
Is well supported by the user community
Most of R’s current capabilities come from packages
Git records changes to a file or set of files over time so that you can recall specific versions later
Notebook interface to weave together text, equations, and code into nicely formatted output
Allows you to create and document fully reproducible workflows
Supports dozens of static and dynamic output formats
We’ll learn about research compendia and style conventions for coding and naming