统计211

 找回密码
 立即注册

QQ登录

只需一步,快速开始

楼主: 论坛COO
打印 上一主题 下一主题

国外统计问答帖子集(English)

  [复制链接]
11
 楼主| 发表于 2013-7-8 09:17:32 | 只看该作者
What are good basic statistics to use for ordinal data?

I have some ordinal data gained from survey questions. In my case they are Likert style responses (Strongly Disagree-Disagree-Neutral-Agree-Strongly Agree). In my data they are coded as 1-5.

I don't think means would mean much here, so what basic summary statistics are considered usefull?



A frequency table is a good place to start. You can do the count, and relative frequency for each level. Also, the total count, and number of missing values may be of use.

You can also use a contingency table to compare two variables at once. Can display using a mosaic plot too.
12
 楼主| 发表于 2013-7-8 09:20:12 | 只看该作者
Can someone please explain the back-propagation algorithm?

What is the back-propagation algorithm and how does it work?



图像 38.png (40.23 KB, 下载次数: 3)

图像 39.png (41.58 KB, 下载次数: 3)
13
 楼主| 发表于 2013-7-8 09:29:13 | 只看该作者
PCA on correlation or covariance?

What are the main differences between performing Principal Components Analysis on a correlation and covariance matrix? Do they give the same results?



You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales. Using the correlation matrix standardises the data.

In general they give different results. Especially when the scales are different.

As example, take a look a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (200m) are around 20.
  1. library(HSAUR)
  2. # look at heptathlon data
  3. heptathlon

  4. # correlations
  5. round(cor(heptathlon[,-8]),2)   # ignoring "score"
  6. # covariance
  7. round(cov(heptathlon[,-8]),2)

  8. # PCA
  9. # scale=T bases the PCA on the correlation matrix
  10. hep.PC.cor = prcomp(heptathlon[,-8], scale=TRUE)
  11. hep.PC.cov = prcomp(heptathlon[,-8], scale=FALSE)

  12. # PC scores per competitor
  13. hep.scores.cor = predict(hep.PC.cor)
  14. hep.scores.cov = predict(hep.PC.cov)

  15. # Plot of PC1 vs PC2
  16. par(mfrow = c(2, 1))
  17. plot(hep.scores.cov[,1],hep.scores.cov[,2],
  18.      xlab="PC 1",ylab="PC 2", pch=NA, main="Covariance")
  19. text(hep.scores.cov[,1],hep.scores.cov[,2],labels=1:25)

  20. plot(hep.scores.cor[,1],hep.scores.cor[,2],
  21.      xlab="PC 1",ylab="PC 2", pch=NA, main="Correlation")
  22. text(hep.scores.cor[,1],hep.scores.cor[,2],labels=1:25)
复制代码
Notice that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.
14
 楼主| 发表于 2013-7-8 09:34:31 | 只看该作者
How would you explain Markov Chain Monte Carlo (MCMC) to a layperson?

Maybe the concept, why it's used, and an example.



First, we need to understand what is a markov chain. Consider the following weather example from Wikipedia. Suppose that weather on any given day can be classified into two states only: sunny and rainy. Based on past experience, we know the following:

Probability(Next day is sunny | Given today is rainy ) = 0.50

Since, the next day's weather is either sunny or rainy it follows that:

Probability(Next day is Rainy | Given today is rainy ) = 0.50

Similarly, let:

Probability(Next day is rainy | Given today is sunny ) = 0.10

Therefore, it follows that:

Probability(Next day is sunny | Given today is sunny ) = 0.90

The above four numbers can be compactly represented as a transition matrix which represents the probabilities of the weather moving from one state to another state as follows:

         S   R
P = S [ 0.9 0.1
    R   0.5 0.5]

We might ask several questions whose answers follow:

Q1: If the weather is sunny today then what is the weather likely to be tomorrow?

A1: Since, we do not know what is going to happen for sure, the best we can say is that there is a 90% chance that it is likely to be sunny and 10% that it will be rainy.

Q2: What about two days from today?

A2: One day prediction: 90% sunny, 10% rainy. Therefore, two days from now:

First day it can be sunny and the next day also it can be sunny. Chances of this happening are: 0.9 0.9.

Or

First day it can be rainy and second day it can be sunny. Chances of this happening are: 0.1 * 0.5

Therefore, the probability that the weather will be sunny in two days is:

Prob(Sunny two days from now) = 0.9 0.9 + 0.1 0.5 = 0.81 + 0.05 = 0.86

Similarly, the probability that it will be rainy is:

Prob(Rainy two days from now) = 0.1 * 0.5 + 0.9 0.1 = 0.05 + 0.09 = 0.14

If you keep forecasting weather like this you will notice that eventually the nth day forecast where n is very large (say 30) settles to the following 'equilibrium' probabilities:

Prob(Sunny) = 0.833 Prob(Rainy) = 0.167

In other words, your forecast for the nth day and the n+1th day remain the same. In addition, you can also check that the 'equilibrium' probabilities do not depend on the weather today. You would get the same forecast for the weather if you start of by assuming that the weather today is sunny or rainy.

The above example will only work if the state transition probabilities satisfy several conditions which I will not discuss here. But, notice the following features of this 'nice' markov chain (nice = transition probabilities satisfy conditions):

Irrespective of the initial starting state we will eventually reach an equilibrium probability distribution of states.

Markov Chain Monte Carlo exploits the above feature as follows:

We want to generate random draws from a target distribution. We then identify a way to construct a 'nice' markov chain such that its equilibrium probability distribution is our target distribution.

If we can construct such a chain then we arbitrarily start from some point and iterate the markov chain many times (like how we forecasted the weather n times). Eventually, the draws we generate would appear as if they are coming from our target distribution.

We then approximate the quantities of interest (e.g. mean) by taking the sample average of the draws after discarding a few initial draws which is the monte carlo component.

There are several ways to construct 'nice' markov chains (e.g., gibbs sampler, Metropolis-Hastings algorithm).
15
 楼主| 发表于 2013-7-8 10:03:39 | 只看该作者
What is the best way to identify outliers in multivariate data?

Suppose I have a large set of multivariate data with at least three variables. How can I find the outliers? Pairwise scatterplots won't work as it is possible for an outlier to exist in 3 dimensions that is not an outlier in any of the 2 dimensional subspaces.

I am not thinking of a regression problem, but of true multivariate data. So answers involving robust regression or computing leverage are not helpful.

One possibility would be to compute the principal component scores and look for an outlier in the bivariate scatterplot of the first two scores. Would that be guaranteed to work? Are there better approaches?



I think Robin Girard's answer would work pretty well for 3 and possibly 4 dimensions, but the curse of dimensionality would prevent it working beyond that. However, his suggestion led me to a related approach which is to apply the cross-validated kernel density estimate to the first three principal component scores. Then a very high-dimensional data set can still be handled ok.

In summary, for i=1 to n

Compute a density estimate of the first three principal component scores obtained from the data set without Xi.
Calculate the likelihood of Xi for the density estimated in step 1. call it Li.
end for

Sort the Li (for i=1,..,n) and the outliers are those with likelihood below some threshold. I'm not sure what would be a good threshold -- I'll leave that for whoever writes the paper on this! One possibility is to do a boxplot of the log(Li) values and see what outliers are detected at the negative end.
16
 楼主| 发表于 2013-7-26 16:29:52 | 只看该作者
How do I order or rank a set of experts?

I have a database containing a large number of experts in a field. For each of those experts i have a variety of attributes/data points like:

    number of years of experience.
    licenses
    num of reviews
    textual content of those reviews
    The 5 star rating on each of those reviews, for a number of factors like speed, quality etc.
    awards, assosciations, conferences etc.

        I want to provide a rating to these experts say out of 10 based on their importance. Some of the data points might be missing for some of the experts. Now my question is how do i come up with such an algorithm? Can anyone point me to some relevent literature?

       Also i am concerned that as with all rating/reviews the numbers might bunch up near some some values. For example most of them might end up getting an 8 or a 5. Is there a way to highlight litle differences into a larger difference in the score for only some of the attributes.
      Some other discussions that i figured might be relevant:
http://stats.stackexchange.com/questions/1848/bayesian-rating-system-with-multiple-categories-for-each-rating

http://stats.stackexchange.com/questions/2689/how-would-you-compute-imdb-movie-rating
http://stats.stackexchange.com/questions/1/eliciting-priors-from-experts
http://stats.stackexchange.com/questions/2563/what-are-some-of-the-best-ranking-algorithms-with-inputs-as-up-and-down-votes




        People have invented numerous systems for rating things (like experts) on multiple criteria: visit the Wikipedia page on Multi-criteria decision analysis for a list. Not well represented there, though, is one of the most defensible methods out there: Multi attribute valuation theory. This includes a set of methods to evaluate trade-offs among sets of criteria in order to (a) determine an appropriate way to re-express values of the individual variables and (b) weight the re-expressed values to obtain a score for ranking. The principles are simple and defensible, the mathematics is unimpeachable, and there's nothing fancy about the theory. More people should know and practice these methods rather than inventing arbitrary scoring systems.
17
 楼主| 发表于 2013-7-26 16:32:38 | 只看该作者
Is adjusting p-values in a multiple regression for multiple comparisons a good idea?

      Lets assume you are a social science researcher/econometrician trying to find relevant predictors of demand for a service. You have 2 outcome/dependent variables describing the demand (using the service yes/no, and the number of occasions). You have 10 predictor/independent variables that could theoretically explain the demand (e.g., age, sex, income, price, race, etc). Running two separate multiple regressions will yield 20 coefficients estimations and their p-values. With enough independent variables in your regressions you would sooner or later find at least one variable with a statistically significant correlation between the dependent and independent variables.

   My question: is it a good idea to correct the p-values for multiple tests if I want to include all independent variables in the regression? Any references to prior work are much appreciated.



It seems your question more generally addresses the problem of identifying good predictors. In this case, you should consider using some kind of penalized regression (methods dealing with variable or feature selection are relevant too), with e.g. L1, L2 (or a combination thereof, the so-called elasticnet) penalties (look for related questions on this site, or the R penalized and elasticnet package, among others).

Now, about correcting p-values for your regression coefficients (or equivalently your partial correlation coefficients) to protect against over-optimism (e.g. with Bonferroni or, better, step-down methods), it seems this would only be relevant if you are considering one model and seek those predictors that contribute a significant part of explained variance, that is if you don't perform model selection (with stepwise selection, or hierarchical testing). This article may be a good start: Bonferroni Adjustments in Tests for Regression Coefficients. Be aware that such correction won't protect you against multicolinearity issue, which affects the reported p-values.

Given your data, I would recommend using some kind of iterative model selection techniques. In R for instance, the stepAIC function allows to perform stepwise model selection by exact AIC. You can also estimate the relative importance of your predictors based on their contribution to R2 using boostrap (see the relaimpo package). I think that reporting effect size measure or % of explained variance are more informative than p-value, especially in a confirmatory model.

It should be noted that stepwise approaches have also their drawbacks (e.g., Wald tests are not adapted to conditional hypothesis as induced by the stepwise procedure), or as indicated by Frank Harrell on R mailing, "stepwise variable selection based on AIC has all the problems of stepwise variable selection based on P-values. AIC is just a restatement of the P-Value" (but AIC remains useful if the set of predictors is already defined); a related question -- Is a variable significant in a linear regression model? -- raised interesting comments (@Rob, among others) about the use of AIC for variable selection. I append a couple of references at the end (including papers kindly provided by @Stephan); there is also a lot of other references on P.Mean.

Frank Harrell authored a book on Regression Modeling Strategy which includes a lot of discussion and advices around this problem (§4.3, pp. 56-60). He also developed efficient R routines to deal with generalized linear models (See the Design or rms packages). So, I think you definitely have to take a look at it (his handouts are available on his homepage).
18
 楼主| 发表于 2013-7-26 16:36:30 | 只看该作者
How can I test the fairness of a d20?
How can I test the fairness of a twenty sided die (d20)? Obviously I would be comparing the distribution of values against a uniform distribution. I vaguely remember using a Chi-square test in college. How can I apply this to see if a die is fair?



Here's an example with R code. The output is preceded by #'s. A fair die:
  1. rolls <- sample(1:20, 200, replace = T)
  2. table(rolls)
  3. #rolls
  4. # 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
  5. # 7  8 11  9 12 14  9 14 11  7 11 10 13  8  8  5 13  9 10 11
  6. chisq.test(table(rolls), p = rep(0.05, 20))

  7. #         Chi-squared test for given probabilities
  8. #
  9. # data:  table(rolls)
  10. # X-squared = 11.6, df = 19, p-value = 0.902
复制代码
A biased die - numbers 1 to 10 each have a probability of 0.045; those 11-20 have a probability of 0.055 - 200 throws:
  1. rolls <- sample(1:20, 200, replace = T, prob=cbind(rep(0.045,10), rep(0.055,10)))
  2. table(rolls)
  3. #rolls
  4. # 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
  5. # 8  9  7 12  9  7 14  5 10 12 11 13 14 16  6 10 10  7  9 11
  6. chisq.test(table(rolls), p = rep(0.05, 20))

  7. #        Chi-squared test for given probabilities
  8. #
  9. # data:  table(rolls)
  10. # X-squared = 16.2, df = 19, p-value = 0.6439
复制代码
We have insufficient evidence of bias (p = 0.64).

A biased die, 1000 throws:
  1. rolls <- sample(1:20, 1000, replace = T, prob=cbind(rep(0.045,10), rep(0.055,10)))
  2. table(rolls)
  3. #rolls
  4. # 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
  5. # 42 47 34 42 47 45 48 43 42 45 52 50 57 57 60 68 49 67 42 63
  6. chisq.test(table(rolls), p = rep(0.05, 20))

  7. #        Chi-squared test for given probabilities
  8. #
  9. # data:  table(rolls)
  10. # X-squared = 32.36, df = 19, p-value = 0.02846
复制代码
Now p<0.05 and we are starting to see evidence of bias. You can use similar simulations to estimate the level of bias you can expect to detect and the number of throws needed to detect it with an given p-level.

Wow, 2 other answers even before I finished typing.
19
 楼主| 发表于 2013-7-26 16:40:16 | 只看该作者
Dealing with missing data due to variable not being measured over initial period of a study


I was recently consulting a researcher in the following situation.

Context:

    data were collected over four years at around 50 participants per year (participants had a specific diagnosed clinical psychology disorder and were difficult to obtain in large numbers); participants were only measured once (i.e., it's not a longitudinal study)
    all participants had the same disorder
    the study involved participants completing a set of 10 psychological scales
    the 10 scales measured various things like symptoms, theorised precursors, and related psychopathology: the measures tended to intercorrelate around r=.3 to .7.
    in the first year one of the scales was not included
    the researcher wanted to run structural equation modelling on all 10 scales on the entire sample. Thus, there was an issue that around a quarter of the sample had missing data on one scale.

The researcher wanted to know:

    What is a good strategy for dealing with missing data like this? What tips, references to applied examples, or references to advice regarding best practice would you suggest?

I had a few thoughts, but I was keen to hear your suggestions.



I like the partial identification approach to missing data of Manski. The basic idea is to ask: given all possible values the missing data could have, what is the set of values that the estimated parameters could take? This set might be very large, in which case you could consider restricting the distribution of the missing data. Manski has a bunch of papers and a book on this topic. This short paper is a good overview.

Inference in partially identified models can be complicated and is an active area of research. This review (ungated pdf) is a good place to get started.

20
 楼主| 发表于 2013-7-26 16:43:17 | 只看该作者
A good way to show lots of data graphically


I'm working on a project that involves 14 variables and 345,000 observations for housing data (things like year built, square footage, price sold, county of residence, etc). I'm concerned with trying to find good graphical techniques and R libraries that contain nice plotting techniques.

I'm already seeing what in ggplot and lattice will work nicely, and I'm thinking of doing violin plots for some of my numerical variables.

What other packages would people recommend for displaying a large amount of either numerical or factor-typed variables in a clear, polished, and most importantly, succinct manner?



The best "graph" is so obvious nobody has mentioned it yet: make maps. Housing data depend fundamentally on spatial location (according to the old saw about real estate), so the very first thing to be done is to make a clear detailed map of each variable. To do this well with a third of a million points really requires an industrial-strength GIS, which can make short work of the process. After that it makes sense to go on and make probability plots and boxplots to explore univariate distributions, and to plot scatterplot matrices and wandering schematic boxplots, etc, to explore dependencies--but the maps will immediately suggest what to explore, how to model the data relationships, and how to break up the data geographically into meaningful subsets.
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则


免责声明|关于我们|小黑屋|联系我们|赞助我们|统计211 ( 闽ICP备09019626号  

GMT+8, 2025-3-31 23:58 , Processed in 0.089635 second(s), 21 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表