March 31, 2014
The four V's: variety, velocity, volume and value (veracity?)
It's big when your usual tools don't work
"When computational time exceeds cognitive time"
A marketing phrase
(More statistical) It's big when your usual inferential approaches don't work
There is always something bigger! We'll try to concentrate on general strategies, but to practice we'll have to learn some specific tools.
As data gets bigger, basic descriptive and/or exploratory analysis becomes hard:
But often answering our questions doesn't require big data.
Making big data small: subset, summarise, sample
We'll also learn some new tools: dplyr, git, SQL
What happens when we apply our usual statistical tools (t-tests, regression etc.) to big data?
Does it even make sense to apply our usual statistical tools to big data? Do we care about p-values?
Bigger data isn't neccessariy better data
Some things don't change: What's the sample? What's the population?
You might have to think about the implementation of your usual tools
A brief survey of some techniques popular in the data mining and machine learning arenas
What do they do? When are they appropropriate? What don't they do? How do they scale?
Classification and Regression trees, Random Forests, clustering methods, …
You learn to think about the statistical issues when using "Big" data
You learn some new tools to cope with realistic data
You get some experience doing data analysis in a team, including communicating your results