国外统计问答帖子集（English）

论坛COO · 发表于 2013-7-8 08:26:25

本帖最后由论坛COO 于 2013-7-8 08:42 编辑

问题用蓝色，解答用红色！

如遇复杂公式，我会直接截图的，如有多个answer我会用分割线隔开，只允许点评，不允许回帖，谢谢各位合作！希望大家能收藏此帖，此帖包罗统计万象，常来看看！博闻强识！（看帖学统计，同时学习英语，何乐而不为？）
Odds and odds ratios in logistic regression

I am having difficulties understanding one logistic regression explanation. The logistic regression is between temperature and fish which die or do not die.

The slope of a logistic regression is 1.76. Then the odds that fish die increase by a factor of exp(1.76) = 5.8. In other words, the odds that fish die increase by a factor of 5.8 for each change of 1 degree Celsius in temperature.

Because 50% fish die in 2012, a 1 degree Celsius increase on 2012 temperature would raise the fish die occurrence to 82%.

A 2 degree Celsius increase on 2012 temperature would raise the fish die occurrence to 97%.

A 3 degree Celsius increase -> 100% fish die.

How do we calculate 1, 2 and 3? (82%, 97% and 100%)

answer1

图像 36.png (48.99 KB, 下载次数: 4)

answer2

图像 37.png (34.48 KB, 下载次数: 4)

论坛COO · 发表于 2013-7-8 08:32:02

F test and t test in linear regression model
F test and t test are performed in regression models.

In linear model output in R, we get fitted values and expected values of response variable. Suppose I have height as explanatory variable and body weight as response variable for 100 data points.

Each variable (explanatory or independent variable, if we have multiple regression model) coefficient in linear model is associated with a t-value (along with its p value)? How is this t-value computed?

Also there is one F test at the end; again I am curious to know about its computation?

Also in ANOVA after linear model, I have seen a F-test.

Although I am new statistics learner and not from statistical background, I have gone through with lots of tutorials on this. Please do not suggest for going me with basic tutorials as i have already done that. I am only curious to know about the T and F test computation using some basic example.

Thanks !!

The misunderstanding is your first premise "F test and t-test are performed between two populations", this is incorrect or at least incomplete. The t-test that is next to a coefficient tests the null hypothesis that that coefficient equals 0. If the corresponding variable is binary, for example 0 = male, 1 = female, then that describes the two populations but with the added complication that you also adjust for the other covariates in your model. If that variable is continuous, for example years of education, you can think of comparing someone with 0 years of education with someone with 1 years of education, and comparing someone with 1 years of education with someone with 2 years of education, etc, with the constraint that each step has the same effect on the expected outcome and again with the complication that you adjust for the other covariates in your model.

An F-test after linear regression tests the null hypothesis that all coefficients in your model except the constant are equal to 0. So the groups that you are comparing is even more complex.

论坛COO · 发表于 2013-7-8 08:39:51

Explain data visualization
How would you explain data visualization and why it is important to a layman?

When I teach very basic statistics to Secondary School Students I talk about evolution and how we have evolved to spot patterns in pictures rather than lists of numbers and that data visualisation is one of the techniques we use to take advantage of this fact.

Plus I try to talk about recent news stories where statistical insight contradicts what the press is implying, making use of sites like Gapminder to find the representation before choosing the story.

Data visualization is taking data, and making a picture out of it. This allows you to easily see and understand relationships within the data much more easily than just looking at the numbers.

I would show them the raw data of Anscombe's Quartet (JSTOR link to the paper) in a big table, alongside another table showing the Mean & Variance of x and y, the correlation coefficient, and the equation of the linear regression line. Ask them to explain the differences between each of the 4 datasets. They will be confused.

Then show them 4 graphs. They will be enlightened.

From Wikipedia: Data visualization is the study of the visual representation of data, meaning "information which has been abstracted in some schematic form, including attributes or variables for the units of information"

Data viz is important for visualizing trends in data, telling a story - See Minard's map of Napoleon's march - possibly one of the best data graphics ever printed.

Also see any of Edward Tufte's books - especially Visual Display of Quantitative Information.

For me Illuminating the Path report has been always good point of reference.
For more recent overview you can also have a look at good article by Heer and colleagues.

But what would explain better than visualization itself?

uFvje.jpg (89.11 KB, 下载次数: 3)

论坛COO · 发表于 2013-7-8 08:46:06

Sample problems on logit modeling and Bayesian methods

I'm looking for worked out solutions using Bayesian and/or logit analysis similar to a workbook or an annal.
The worked out problems could be of any field; however, I'm interested in urban planning / transportation related fields.

The UCLA Statistical Computing site has a number of examples in various languages (SAS, R, etc). In particular, see the following pages (look among the links titled logistic regression, categorical data analysis and generalized linear models):

Data Analysis Examples

Textbook Examples

论坛COO · 发表于 2013-7-8 08:55:03

Examples to teach: Correlation does not mean causation

We all know the old saying "Correlation does not mean causation". When I'm teaching I tend to use these standard examples to illustrate this point:

Number of storks and birth rate in Denmark;
Number of priests in America and alcoholism
In the start of the 20th century it was noted that there was a strong correlation between `Number of radios' and 'Number of people in Insane Asylums'
and my favourite: pirates cause global warming
However, I don't have any references for these examples and whilst amusing, they are obviously false.

Does anyone have any other good examples?

It might be useful to explain that "causes" is an asymmetric relation (X causes Y is different from Y causes X), whereas "is correlated with" is a symmetric relation.

For instance, homeless population and crime rate might be correlated, in that both tend to be high or low in the same locations. It is equally valid to say that homelesss population is correlated with crime rate, or crime rate is correlated with homeless population. To say that crime causes homelessness, or homeless populations cause crime are different statements. And correlation does not imply that either is true. For instance, the underlying cause could be a 3rd variable such as drug abuse, or unemployment.

The mathematics of statistics is not good at identifying underlying causes, which requires some other form of judgement.

Sometimes correlation is enough. For example, in car insurance, male drivers are correlated with more accidents, so insurance companies charge them more. There is no way you could actually test this for causation. You cannot change the genders of the drivers experimentally. Google has made hundreds of billions of dollars not caring about causation.

To find causation, you generally need experimental data, not observational data. Though, in economics, they often use observed "shocks" to the system to test for causation, like if a CEO dies suddenly and the stock price goes up, you can assume causation.

Correlation is a necessary but not sufficient condition for causation. To show causation requires a counter-factual.

I have a few examples I like to use.

When investigating the cause of crime in New York City in the 80s, when they were trying to clean up the city, an academic found a strong correlation between the amount of serious crime committed and the amount of ice cream sold by street vendors! (Which is the cause and which is the effect?) Obviously, there was an unobserved variable causing both. Summers are when crime is the greatest and when the most ice cream is sold.

The size of your palm is negatively correlated with how long you will live (really!). In fact, women tend to have smaller palms and live longer.

[My favorite] I heard of a study a few years ago that found the amount of soda a person drinks is positively correlated to the likelihood of obesity. (I said to myself - that makes sense since it must be due to people drinking the sugary soda and getting all those empty calories.) A few days later more details came out. Almost all the correlation was due to an increased consumption of diet soft drinks. (That blew my theory!) So, which way is the causation? Do the diet soft drinks cause one to gain weight, or does a gain in weight cause an increased consumption in diet soft drinks? (Before you conclude it is the latter, see the study where a controlled experiments with rats showed the group that was fed a yogurt with artificial sweetener gained more weight than the group that was fed the normal yogurt.)

The number of Nobel prizes won by a country (adjusting for population) correlates well with per capita chocolate consumption

TWQaB.jpg (31.09 KB, 下载次数: 3)

A correlation on its own can never establish a causal link. David Hume (1771-1776) argued quite effectively that we can not obtain certain knowlege of cauasality by purely empirical means. Kant attempted to address this, the Wikipedia page for Kant seems to sum it up quite nicely:

Kant believed himself to be creating a compromise between the empiricists and the rationalists. The empiricists believed that knowledge is acquired through experience alone, but the rationalists maintained that such knowledge is open to Cartesian doubt and that reason alone provides us with knowledge. Kant argues, however, that using reason without applying it to experience will only lead to illusions, while experience will be purely subjective without first being subsumed under pure reason.
In otherwords, Hume tells us that we can never know a causal relationship exists just by observing a correlation, but Kant suggests that we may be able to use our reason to distinguish between correlations that do imply a causal link from those who don't. I don't think Hume would have disagreed, as long as Kant were writing in terms of plausibility rather than certain knowledge.

In short, a correlation provides circumstantial evidence implying a causal link, but the weight of the evidence depends greatly on the particular circumstances involved, and we can never be absolutely sure. The ability to predict the effects of interventions is one way to gain confidence (we can't prove anything, but we can disprove by observational evidence, so we have then at least attempted to falsify the theory of a causal link). Having a simple model that explains why we should observed a correlation that also explains other forms of evidence is another way we can apply our reasoning as Kant suggests.

Caveat emptor: It is entirely possible I have misunderstood the philosophy, however it remains the case that a correlation can never provide proof of a causal link.

论坛COO · 发表于 2013-7-8 08:58:16

R packages for seasonality analysis

What R packages should I install for seasonality analysis?

You don't need to install any packages because this is possible with base-R functions. Have a look at the arima function.
This is a basic function of Box-Jenkins analysis, so you should consider reading one of the R time series text-books for an overview; my favorite is Shumway and Stoffer. "Time Series Analysis and Its Applications: With R Examples".

论坛COO · 发表于 2013-7-8 09:01:23

Finding the PDF given the CDF

How can I find the PDF (probability density function) of a distribution given the CDF (cumulative distribution function)?

As user28 said in comments above, the pdf is the first derivative of the cdf for a continuous random variable, and the difference for a discrete random variable.

In the continuous case, wherever the cdf has a discontinuity the pdf has an atom. Dirac delta "functions" can be used to represent these atoms.

论坛COO · 发表于 2013-7-8 09:03:04

How can I adapt ANOVA for binary data?

I have four competing models which I use to predict a binary outcome variable (say, employment status after graduating, 1 = employed, 0 = not-employed) for n subjects. A natural metric of model performance is hit rate which is the percentage of correct predictions for each one of the models.

It seems to me that I cannot use ANOVA in this setting as the data violates the assumptions underlying ANOVA. Is there an equivalent procedure I could use instead of ANOVA in the above setting to test for the hypothesis that all four models are equally effective?

Contingency table (chi-square). Also Logistic Regression is your friend - use dummy variables.

论坛COO · 发表于 2013-7-8 09:05:48

Multivariate Interpolation Approaches

Is there a good, modern treatment covering the various methods of multivariate interpolation, including which methodologies are typically best for particular types of problems? I'm interested in a solid statistical treatment including error estimates under various model assumptions.

An example:

Shepard's method

Say we're sampling from a multivariate normal distribution with unknown parameters. What can we say about the standard error of the interpolated estimates?

I was hoping for a pointer to a general survey addressing similar questions for the various types of multivariate interpolations in common use.

Sorry, no quick answer. There are thick books dedicated to answering this question. Here's a 600-page long example: Harrell's Regression Modeling Strategies

论坛COO · 发表于 2013-7-8 09:15:32

What's the purpose of window function in spectral analysis?

I'd like to see the answer with qualitative view on the problem, not just definition. Examples and analogous from other areas of applied math also would be good.

I understand, my question is silly, but I can't find good and intuitive introduction textbook on signal processing — if someone would suggest one, I will be happy.

Thanks.

It depends on where you apply the window function. If you do it in the time domain, it's because you only want to analyze the periodic behavior of the function in a short duration. You do this when you don't believe that your data is from a stationary process.

If you do it in the frequency domain, then you do it to isolate a specific set of frequencies for further analysis; you do this when you believe that (for instance) high-frequency components are spurious.

The first three chapters of "A Wavelet Tour of Signal Processing" by Stephane Mallat have an excellent introduction to signal processing in general, and chapter 4 goes into a very good discussion of windowing and time-frequency representations in both continuous and discrete time, along with a few worked-out examples.

帐号		自动登录	找回密码
密码			立即注册

国外统计问答帖子集（English）

评分