There are many kinds of competitive events that we may be interested in modeling, for instance election campaigns, horse and greyhound races, various athletic/track-and-field events, etc. In each of these competitive events, for each contestant we would...

From: Stats Stack Exchange | By: gsq10 | Friday, October 21, 2016

On this site TEXT CLASSIFICATION FOR SENTIMENT ANALYSIS – ELIMINATE LOW INFORMATION FEATURES, low information features are identified by using the BigramAssocMeasures.chi_sq function of NLTK. My question: What does $\chi^2$ have to do with this? How...

From: Stats Stack Exchange | By: user3813234 | Thursday, October 20, 2016

I'm interested in kernel bootstrap (adding random noise drawn from kernel densities to bootstrap samples, to obtain "smoothed" bootstrap samples). Can you provide references dealing with it? Silverman (1986) briefly describes it, but I'd be looking for...

From: Stats Stack Exchange | By: Tim | Saturday, October 22, 2016

I would like to write the code to forcast the status. The status 0 means non-active, 1 means active. I would like to preicit the future month (e.g 2016/6/1), the status should be "0" or "1". What algorithm could be used in such a situation? date status...

From: Stats Stack Exchange | By: illy | Saturday, October 22, 2016

If I have XBar = 0.14 and Ybar = 0.139 and the Population SD of X is 0.0026 and the Population SD of Y is 0.0024 how can I construct a test to see if I can assume equal population standard deviations? the population for both statistics is 6 so here is...

From: Stats Stack Exchange | By: Itachi San | Saturday, October 22, 2016

Am conducting a research and using SARIMA models in my analysis. I have transformed my data to achieve stationarity (first difference and seasonal difference) and my ACF and PACF are as below.However i have challenges in identifying the values of p,q,...

From: Stats Stack Exchange | By: Ocy | Saturday, October 22, 2016

0 down vote favorite So I need to create a permutation distribution of the difference in the proportions for a data set, however I'm not sure the best way to go about doing so. This is the table that I need it for. I have to asses whether the difference...

From: Stats Stack Exchange | By: gg2525 | Saturday, October 22, 2016

I'm using R studio v0.99.903 on Windows 10 Home edition. When using pca with the tresh parameter in the preProcess function of R's caret package, he doesn't respect my treshold value and keeps building the number of principal components that explain...

From: Stats Stack Exchange | By: Wendy De Wit | Saturday, October 22, 2016

I planted about 300 different genotypes of plants each of them cloned three times in a randomized block design (3 blocks). I have measured 15 phenotypic traits. Before the experiment started, each lineage was classified either as "invasive" or as "non-invasive"....

From: Stats Stack Exchange | By: Remi.b | Saturday, October 22, 2016

Are The two concepts really two sides of the same coin ? The latter is often referred to simply as a regression, but surely this is just an unfortunate coincidence? The former is about predicting values of a dependent variable from a weighted combination...

From: Stats Stack Exchange | By: wildetudor | Saturday, October 22, 2016

We run a 3-NN classifier.
We are looking at a case in which a point has 5 points (or something greater than K, which is 3 at the moment) of equal distance.
What can we do in this case?

From: Stats Stack Exchange | By: CaTx | Saturday, October 22, 2016

Imagine the following situation: We need to revise certain proposals because we have room for only n out of n_{tot} of them. To perform the revision we have r independent revisors that are going to score each proposal. Each revisor is provided with a...

From: Stats Stack Exchange | By: Dargor | Saturday, October 22, 2016

I have a basic question regarding approaches to model averaging using IT criteria to weight models within a candidate set. Most sources that I have read on model averaging advocate averaging the parameter coefficient estimates based on model weights...

From: Stats Stack Exchange | By: John Stella | Saturday, October 22, 2016

Basically a modelling exercise, the problem can be stated as follows: Suppose there are $N$ bernoulli trials, $v_n$, with probability $p_n$ of heads. Suppose further there is for each bernoulli trial a bernoulli "signal" about the outcome of the trial....

From: Stats Stack Exchange | By: ZMI | Saturday, October 22, 2016

I'm building an algorithm in R to calculate the Shapley Value for players in a collaborative game. However, I do not have an outcome value for all possible coalitions, partially because the number of players is relatively high (in the 100s/1000s), and...

From: Stats Stack Exchange | By: Andy C | Saturday, October 22, 2016

I don't believe that (Why does increasing the sample size lower the variance?) appropriately handles my question! The linked questions explains why any addition of random variables (all iid) produces a new RV whose variance is less than the variance...

From: Stats Stack Exchange | By: Muno | Saturday, October 22, 2016

In R package mgcv, is it possible to explicitly force smooths to be centered? I have a term of type s(x,by=z) with z a numeric vector, so the smooth has an intercept by default. Note: The package manual only mentions how to do so for customized smooth...

From: Stats Stack Exchange | By: yannick | Saturday, October 22, 2016

Assume a simple regression model, $y = \beta_0 + x\beta_1 + u$. I decide to change the units of measurement for the explanatory variable and the response variable. Do the $\beta_0$ and $\beta_1$ parameters change as well? I assume that the $\beta$ parameters...

From: Stats Stack Exchange | By: user358065 | Saturday, October 22, 2016

Specifically, I'm looking at 3 different commodity futures prices. I want to test that each is unit root and that they are cointegrated. Next, I'd like to apply a model to them to test if they (or a combination of them) are mean reverting. What I want...

From: Stats Stack Exchange | By: ralcorn | Saturday, October 22, 2016

There is this question in a problem set I have to solve and I'm really confused. Please help me or give me some hints how I could solve the problem. The question is as follows: Let $(X_1,X_2,...,X_n)$ be a random sample of $n$ observations from a normal...

From: Stats Stack Exchange | By: JohnBlack | Saturday, October 22, 2016

I have read some tutorials and looked at some of the questions here, as well, still I am unsure about the questions I ask below. I would really appreciate if you could help me clarify them, if possible with concrete examples.. If I understand it correctly,...

From: Stats Stack Exchange | By: Pegah | Saturday, October 22, 2016

I am trying to calculate the autocorrelation of this process: $$ x(n)=\sum_{l=1}^{L} A_l \sin(2\pi fn+\theta_l)+\text{w}(n) $$ Where $\ \ \theta_l \sim \text{unif}(0, 2\pi)$ and $\text{w}(n)$ White noise, uncorrelated of variance $\sigma_w^2$ I get stuck...

From: Stats Stack Exchange | By: jagjordi | Saturday, October 22, 2016

I am writing a report about statistical fallacies. I cannot discuss every one of the hundreds that exist, so I would like to pick maybe ten of the most important. Please would someone suggest a (somewhat) rigorous way of identifying which fallacies are...

From: Stats Stack Exchange | By: user135762 | Saturday, October 22, 2016

I am currently trying to classify clothes for my final project in school. My problem is that after I gathered more data, to counteract overfitting, the validation accuracy dropped from 60% to 45%. Below I explain in detail what I did. I use the following...

From: Stats Stack Exchange | By: 0xCOFFEED00D | Saturday, October 22, 2016

I have a bionomial random variable $X \sim B(n,p)$. Is there any closed form or upper bound for the variance of the absolute deviation $|X-EX|$ ?

From: Stats Stack Exchange | By: K. Lakshmanan | Saturday, October 22, 2016

When we don't have an a priori model for the dependent variable, how do we understand the significance of an error term ?

From: Stats Stack Exchange | By: DS R | Saturday, October 22, 2016

I'm somewhat novice to applied stats, so please bear with me if this question looks trivial. Consider the following setup, where a class of n=282 students made three examinations followed by two novel pedagogical interventions I1 and I2. The second column...

From: Stats Stack Exchange | By: Arnold Klein | Saturday, October 22, 2016

I am trying to assess a model's prediction performance, and one metric I look at is the percentage of new observations that fall within the 95% prediction interval, and see whether or not it is actually close to 95%. If the percentage of new observations...

From: Stats Stack Exchange | By: Laura Lels | Friday, October 21, 2016

I have individual level whole population registry data where each observation describes a person id receiving a certain diagnosis x over a ten year period. The nature of x is such that each individual is very likely to receive it at least once each year...

From: Stats Stack Exchange | By: user6571411 | Saturday, October 22, 2016

I'm trying to understand the OLS model and the assumptions behind it. I'm struggling between 2 texts with a different approach: nyu do not talk about likelihood estimation theory at all. They develop the Gauss-Markov theorem with a minimum amount of...

From: Stats Stack Exchange | By: ihadanny | Saturday, October 22, 2016

I'm using Naive Bayes algorithm to classify movie reviews (positive and negative). I tried to eliminate stop words (by a stop words list) before running the algorithm then I realized it led to a worse accuracy than not doing that. My question is if stop...

From: Stats Stack Exchange | By: Trung Lê Hoàng | Saturday, October 22, 2016

I have hourly temperature and power consumption data of several days of a month. The pattern is almost similar across days like this: Using this data I want to predict the usage of a coming day. I have features : 1) hour of the day 2) temperature; and...

From: Stats Stack Exchange | By: Haroon Rashid | Saturday, October 22, 2016

Normally within the model block I might specify a prior on a parameter with y ~ normal(mu, sigma); But what if I already have a posterior on y from a previous analysis, and what to use that posterior as the prior in the new analysis. Can I import a set...

From: Stats Stack Exchange | By: osazuwa | Saturday, October 22, 2016

Let's say I have the following regression specification: model <- glm(income ~ education + age + married + race + left + quartile_health, family=binomial, data=data) --If I wanted to compute the marginal effect of left (a binary variable taking 1...

From: Stats Stack Exchange | By: Parseltongue | Saturday, October 22, 2016

I am using dnbinom() for writting the log-likelihood function and then estimate parameters using mle2() {bbmle} function in R. The problem is that I got 16 warnings, all of them NANs produced like this one: 1: In dnbinom(yobs, mu = mu_j, size = k, log...

From: Stats Stack Exchange | By: Sergio Nolazco | Saturday, October 22, 2016

I am performing an LDA with very unequal sample size (1:10) among 3 groups. The results surprise me, as I was expected from a series of boxplot that some variables that explain much of the between group variance to show up on at least one of the first...

From: Stats Stack Exchange | By: Remi.b | Saturday, October 22, 2016

In logistics regression, can I still rely on the odds ratio even if the P-value is not significant? For example, p = 0.140 and OR = 2.834.
Thanks.

From: Stats Stack Exchange | By: Ali | Saturday, October 22, 2016

I'm trying to simulate outcomes in a zero-inflated negative binomial model. My model gives an estimate of the dispersion parameter theta, as well as a standard error for log (theta). My question is, what is the uncertainty distribution around theta?...

From: Stats Stack Exchange | By: EdSeab | Saturday, October 22, 2016

Let $T = X(Y/n)^{-1/2}$ with $X \sim N \left(0,1 \right)$ and $Y \sim Gamma \left(\frac{n}{2},\frac{1}{2}\right)$ with $ n \geq 3$. The Gamma and Normal distributions are independent. Gamma using the following parametrization: $f(x;k,\theta) = \frac{\theta^{k}x^{k-1}e^{-\theta...

From: Stats Stack Exchange | By: mechanical_fan | Friday, October 21, 2016

I have several features I'd like to use for computing cosine similarity between rows in a data set. However, two of them are latitude and longitude. Apart from the fact that it's not the "correct" way to measure the distance between points on the surface...

From: Stats Stack Exchange | By: ssdecontrol | Friday, October 21, 2016

Is there an advantage to use bagging with k-NN? I constantly get better performances while doing it. Can it be that it is because of resampling with repeated instances, which will therefore be classified in the same neighborhood? Is this some kind of...

From: Stats Stack Exchange | By: user2883596 | Friday, October 21, 2016

I want to cluster a set of items, and have identified hierarchical cluster analysis as a means of doing this, but due to the methodology used to collect the data only a distance/dissimilarity matrix is available (I can elaborate on why, if necessary)....

From: Stats Stack Exchange | By: Ian_Fin | Friday, October 21, 2016

When plotting the principal components over the original data why is the plot correct when PCA uses the correlation matrix and incorrect when it uses the covariance matrix. I have a data matrix d and I subtract the column means and put them in the means...

From: Stats Stack Exchange | By: user3022875 | Friday, October 21, 2016

So basically I know two SSR, and n. How to apply F test on it? I am thinking about use the difference of SSR to divide it by Var?...

From: Stats Stack Exchange | By: J.doe | Friday, October 21, 2016

As in subject. If you have half page to explain dropout how would you proceed? Which is the rationale behind?

From: Stats Stack Exchange | By: Davide C | Friday, October 21, 2016

I am having a problem with the density of the first order statistic of a series of n random variables iid with common distribution (standard normal). I am using Arnold's book as a reference for such a density function for the k'th order stat: $$ f_{X_{k:n}}(x)...

From: Stats Stack Exchange | By: mrb | Friday, October 21, 2016

I'm new to statistics and reading stats books I found different definitions for Variance. Definition1: $ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $ Definition2: $ s^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 $ I would like to know why is...

From: Stats Stack Exchange | By: Jon | Friday, October 21, 2016

I think that I am looking for everything outside $B$. So why is the answer $23/30$ instead of $2/3$? Thanks

From: Stats Stack Exchange | By: Lanous | Friday, October 21, 2016

I'm currently trying to complete the book 'Introduction to Statistical Learning' by James et al. and I'm stuck in one of the exercises, trying to get some logic out of the results. The question is this one (see in line): I collect a set of data (n =...

From: Stats Stack Exchange | By: cimentadaj | Friday, October 21, 2016

I just want to ask a question about notation in this exercise. In Equation $X^*_{.,i} =M_{X.,-i} * X_i$ ; $X^*_{.,i}$ means ith column of original matrix $M_{X.,-i}$ means orthogonal projection matrix of the column space $X^*_{.,-i}$(Every column except...

From: Stats Stack Exchange | By: Daniel Yefimov | Friday, October 21, 2016

