|
PCA on correlation or covariance?
What are the main differences between performing Principal Components Analysis on a correlation and covariance matrix? Do they give the same results?
You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales. Using the correlation matrix standardises the data.
In general they give different results. Especially when the scales are different.
As example, take a look a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (200m) are around 20.- library(HSAUR)
- # look at heptathlon data
- heptathlon
- # correlations
- round(cor(heptathlon[,-8]),2) # ignoring "score"
- # covariance
- round(cov(heptathlon[,-8]),2)
- # PCA
- # scale=T bases the PCA on the correlation matrix
- hep.PC.cor = prcomp(heptathlon[,-8], scale=TRUE)
- hep.PC.cov = prcomp(heptathlon[,-8], scale=FALSE)
- # PC scores per competitor
- hep.scores.cor = predict(hep.PC.cor)
- hep.scores.cov = predict(hep.PC.cov)
- # Plot of PC1 vs PC2
- par(mfrow = c(2, 1))
- plot(hep.scores.cov[,1],hep.scores.cov[,2],
- xlab="PC 1",ylab="PC 2", pch=NA, main="Covariance")
- text(hep.scores.cov[,1],hep.scores.cov[,2],labels=1:25)
- plot(hep.scores.cor[,1],hep.scores.cor[,2],
- xlab="PC 1",ylab="PC 2", pch=NA, main="Correlation")
- text(hep.scores.cor[,1],hep.scores.cor[,2],labels=1:25)
复制代码 Notice that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used. |
|