4 January 2021

WHO AM I?

Mark Scheuerell

  • Asst Unit Leader, USGS WACFWRU

  • Assoc Professor, School of Aquatic and Fishery Sciences

My role as course instructor

  • Help you learn the material
  • Help you learn how to ask for help
  • Be a future resource

My role as course instructor

  • I plan to fail
  • A lot
  • In public

WHO ARE YOU?

Introduce yourself via Zoom chat:

  1. Your degree program (undergrad, grad)

  2. Your school/department

  3. Your major (undergrads) or area of study (grads)


For example,

undergrad, SEFS, ESRM

grad, SAFS, salmon migration timing

What is data science?

Data science is…

  • statistics on a Mac
  • 80% preparing data, 20% complaining about preparing data
  • 90% Googling and 10% pasting from Stack Overflow

Data science motto

If at first you don’t succeed, call it version 1.0

Data science is interdisciplanary

  • Math and statistics

  • Computer science

  • Knowledge about the system

  • Communication

Data science process

This should all be reproducible

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results

Data science time

  1. Define the question of interest
  2. Get the data
  3. Clean the data
  4. Explore the data
  5. Fit statistical models
  6. Communicate the results

Making this reproducible

 

Good data science is distinguished from bad data science primarily by a repeatable, thoughtful, skeptical application of an analytic process to data in order to arrive at supportable conclusions.

– Jeff Leek

Questions in data science

Leek & Peng (2015) Science

Descriptive

Summarizes the information in a single data set without further interpretation

ex: the US Census

Exploratory

Searches for trends, correlations, etc among multiple variables

ex: “data mining”

Inferential

Generalizes to the population from a sample

ex: wearing a mask reduces the spread of COVID

Predictive

Uses a subset of the data to parameterize a model and predicts out-of-sample data

ex: using weather to predict recreational fishing effort

Causal

Seeks to understand the average magnitude and direction of a response

ex: experimental analysis of temperature effects on oyster growth

Mechanistic

Seeks to understand the specific magnitude and direction of relationships

ex: engineering studies

Storytelling in data science

Communicating an analysis

Although you may have spent weeks/months on an analysis, most people only want the bottom line

The key is weaving everything into a nice story to achieve both trust and belief

Trust

Do you accept the analysis per se?

Were the data analyzed properly and thoroughly?


Trust is specific to the analysis/analyst (internal)

Belief

Do you believe the analysis?


Belief is more broadly related to previous work and other factors (external)

3 parts to an analysis

  1. The things you did and presented

  2. The things you did but did not present

  3. The things you did not do

The trick is to strike a balance to achieve trust and belief

Things we’ll emphasize

  • Thinking about data separately from research
  • Increasing efficiency & reproducibility
  • Facilitating collaboration with others
  • Finding answers from the community
  • Communicating our workflows & results

Things we won’t emphasize

  • Specific options for cleaning & tidying data (eg, tidyverse)
  • Specific analytical techniques (eg, machine learning)
  • Specific plotting techniques (eg, ggplot)

How are we going to do this?

  • Use lots of online resources

  • Use open/live coding

  • Rely on our community for help