|
深入浅出数据分析这本书真是很浅显易懂,可读性很强,就算是全英版本看着也不费力气。花了大概三个星期的时间将这本书读完,可能是因为书里讲过的知识都有学过,所以算是一个温习的过程,上手也快,也大概了解了各种方法应该怎么去运用。下面是阅读时做的一些笔记。
Chapter 1. Introduction to data analysis
The basic process of data analysis:
Define → Disassemble → Evaluate → Decide
■Define: find the general problem, understand the goal better;
■Disassemble: cut the problem into small pieces, find strong comparisons to isolate the most important elements;
■Evaluate: the key is comparison, make your own assumptions explicitly;
■Decide: compare your customer's belief to your interpretation of the data and recommend a decision.
Chapter 2. Experiments-Test your theories
The more comparative the analysis is, the better.
Observational study: A study where the people being described decide on their own which groups they belong to.
A experiment with the strategies is needed in order to know which one is the best.
Control group: A group of treatment subjects that represent the status quo, not receiving any new treatment.
Chapter 3. Optimization-Take it to the max
Optimization problem: to get as much as something as possible by changing the values of other quantities
Optimization problem: Objective function; constraints
Feasible region
All models are wrong, but some are useful.
Calibrate the assumptions to your analytical objectives.
Any time you create a model, make sure you specify the assumptions about how the variables relate to each other.
Chapter 4. Data visualization-Pictures make you smarter
Data visuliazation: shows the data, makes a smart comparison, shows multiple variables
Showing more variables by looking at charts together.
Chapter 5. Hypothesis testing-Say it ain't so
Falsification(证伪) is the heart of hypothesis testing.
Diagnosticity(可诊断性) is the ability of evidence to help you assess the relatibe likelihood of the hypotheses you are considering. If evidence is diagnostic, it helps you rank your hypotheses.
Chapter 6. Bayesian statistics-Get past the first base
Conditional probability: the probability of some event when other event has happened.
P(+|L)=1-P(-|L);P(+|~L)=1-P(-|~L)[the tilde symbol means the statement L is not true]
prior probability: base rate
P(A|B)=P(B|A)*P(A)/P(B);P(B)=P(B|A)*P(A)+P(B|~A)*P(~A)
Chapter 7. Subjective probabilities-Numerical belief
Subjective probabilities are a great way to apply discipline to an analysis, especially when you are predicting single events that lack hard data to describe what happened preiously under identical conditions.
Standard deviation: measures how far typical points are from the average of the data set.
Standard deviation in Excel: STDEV
Bayes' rule is great for revising subjective probabilities.
Chapter 8. Heuristics-Analyze like a human
Heuristic(启发式) is seeing a few options while intuition is seeing one option and optimization is seeing all the options.
Heuristic: (Psychological definition) substituting a difficult of confusing attribute for a more accessible one; (Computer science definition) a way of solving a problem that will tend to give you accurate answers but that does not guarantee optimality.
Data analysis is all about breaking down problems into manageable pieces and fitting mental and statistical models to data to make better judgements.
Fast and frugal tree is a schematic way of describing a heuristic.
Chapter 9. Histograms-The shape of numbers
When you get a huge set of data, use histograms to help you understand them better.
Histogram in Excel: Data- Data analysis- Histograms [show frequencies of groups of numbers, the distribution of data points across their range of values]
A better way is to use R.
R: hist(data, breaks=); breaks mean the number of bars in the histogram.
Chapter 10. Regression- Prediction
Algorithm: a procedure you follow to complete a calculation.
Scatterplot → Fitted line → Correlation coefficient → Linear equation →Prediction
When the two variables are in pairs that describe the same underlying thing or person, you can use the scatterplot.
The regression line is just the line that best fits the points on the graph of averages.
Correlation coefficient (r): A correlation is a linear association between two variables.
As long as you can see a solid association between your two variables, and as long as your regression makes sense, you can trust your software to deal with the coefficirnts.
Chapter 11. Error -Err well
Error range for your prediction.
Extrapolation: using a regression equation to predict a value outside your range.
For a prediction, you should make it clear the suitable data range to the client to avoid extrapolation, or you can make some assumptions.
Chance errors (residuals) are deviations from what your model predicts.
Specify error quantitatively, explain how far away from the prediction typical outcomes will be.
Residual distribution: the spread of the chance error.
Quantify the residual distribution with Root Mean Squared error ( sigma or residual standard error).
Segmentation is splitting data into groups. it will help you manage error by providing more sensible statistics to describe what happens in each region.
Good regressions balance explanation and prediction.
Chapter 12. Relational databases - Can you relate?
A database is a collection of data with well-specified relations to each other.
plot(sales~jitter(article.count),data=mydata); jitter() means add some noise to the original data so that it can show more clear in a scatterplot (compare the plot without jitter() and you will see).
The jitter command adds a little bit of noise to your numbers, which separates them a little and makes them easier to see on the scatterplot.
RDBMS: relational database management system.
library(lattice)
xyplot(webHits~commentCount|authorName,data=mydata)
Chapter 13. Cleaning data - Impose order
Cleaning messy data is all about preparation.
Excel formula:
find: tell you where to find a search string within a cell
left: grab characters on the left side of a cell
right: grab characters on the right side of a cell
trim: remove excess blank spaces from a cell
len: tell you the length of a cell
concatenate: take two values and stick them together
value: return a numerical value for a number stored as text
substitute: replace text you do not want in a cell with new text you specify
Regular expressions are the ultimate tool for cleaning up messy data.
NewLastName=sub(“//(.*//)”,””,mydata$LastName)
Regular expression: \\(.*//) ; \\( tells R that the parenthesis is not idself an R expression; . means any character;* means any number of the preceding character; .* means everything in between;
Remove duplicates in R: unique(mydata)
Steps of cleaning data: save a copy of original data → previsualize the final data set → identify repetitive patterns in the data → clean and restructure → use your finished data |
|