Statistical Computing and Big Data

Oregon State University

Spring 2014

Lectures: MWF 900-950 OWEN 103
Instructors:
Charlotte Wickham, 76 Kidder charlotte.wickham@stat.oregonstate.edu
Alix Gitelman, 48 Kidder gitelman@stat.oregonstate.edu

Office hours:
Wickham: 1-2pm WF in 76 Kidder
Gitelman: 2-3pm M in 48 Kidder

Week 1

Introduction

Reading:

dplyr vignette Install the dplyr package in R, and from the help file (type ??dplyr at the command line) access the dplyr vignette (click on dplyr::introduction). Read through the vignette and perform all of the commands.

Large Datasets and You: A Field Guide

The Split-Apply-Combine Strategy for Data Analysis by H. Wickham

Week 2

Reading:

Eight (No, Nine!) Problems With Big Data

Big data and big business: Should statisticians join in?

Why Big Data is Bad for Science

Is Big Data an Economic Big Dud?

Where Does a Statistician Fit in the Big Data Era?

The ASA and Big Data

Week 3

Reading:

Performance of R At least read the sections: Introduction, Why is R slow?, Microbenchmarking and Implementation performance. There are some suggested exercises, do some!

Memory in R At least read the sections: Memory, object.size(), Total memory use, Garbarge collection. Again try some of the exercises

git If you haven't already, read at least the first tutorial linked.

(optional) Chapter 14 in The Art of R programming by Norman Matloff. Another discussion of speed in R and memory in R you might find useful

Week 4

Presentations

Reading:

Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods The foundational paper attempting to rank perceptual tasks

A Tour through the Visualization Zoo Some ideas for more exotic visualizations

Infovis and Statistical Graphics: Different Goals, Different Looks A good discussion by Andrew Gelman and Antony Unwin of the varying goals of graphics.

Week 5

Reading:

Big Data: are we making a big mistake?

Return of Sampling Statistics

The dplyr databases vignette

How are databases efficient? Read the answers and follow a few links

Week 6

Week 7

Presentations

  • May 12. Group 2 (Edwards M., Narendra Babu, Zhang), Group 3 (Brintz, Eng, Wang, Wei)
  • May 14. Group 6 (Guermond, Olstad, Pahukula, Skalland), Group 1 (Choi, Dong, James, Shellhammer)
  • May 16. Group 5 (Guyer, Lei, Zhuo, Zongo), Group 4 (Bernath, Edwards E., Kitada)

Week 8

Machine Learning

Reading:

Chapter 1 from Machine Learning by K. Murphy

Bias Variance tradeoff great tutorial on prediction error

Measuring error great tutorial on measuring prediction error

Cross validation nice slides illustrating cross validation

  • May 19. Debrief
  • May 21. Project 3 release
  • May 23. Progress report and general pointers

Week 9

Reading:

Big Data tools read about the tool assigned to you and submit on Blackboard by Friday June 6. Read about a few others too!

Week 10

Presentations