I have two columns which contain investment bank names in joint debt offerings. I want to create a dummy variable to if the Bank 1 & Bank 2 have recurring relationships. Ex: if Lehman Brothers & Goldman Sachs or Goldman Sachs & Lehman Brothers...

From: Stats Stack Exchange | By: user3743720 | Friday, November 28, 2014

I'm fitting some machine learning algorithms (e.g. SVM) on my panel data. It's taking too long for my entire dataset, so I'm considering generating smaller samples from bootstrapping then fit the SVM in parallel. At first glance, it may seem like by...

From: Stats Stack Exchange | By: Heisenberg | Wednesday, November 26, 2014

I have a database with a binary response variable and 100 predictors (correlated and uncorrelated). I want to try the machine learning techniques in R I've been reading about in the last 3 weeks (ridge, lasso, decision tree, boosting, random forest,...

From: Stats Stack Exchange | By: lorelai | Wednesday, November 26, 2014

I would like to the formula of the deviance of the exponential distribution (and the proof if possible, but no worries if not). Can any please suggest online source about that?

From: Stats Stack Exchange | By: Günal | Friday, November 28, 2014

I was told that if it is reasonable that a linear regression had to go through the origin, one should force it. For example, we expect that mass is proportional to volume. No volume should mean no mass, so we have an extra data point. But I've read that...

From: Stats Stack Exchange | By: jinawee | Friday, November 28, 2014

How can I create this horizontal line showing two group comparison in box-plot with several groups: http://openi.nlm.nih.gov/imgs/512/318/2699850/2699850_zdb0070957700001.png Apparently boxplot function does not have an option to produce this two group...

From: Stats Stack Exchange | By: arkiaamu | Friday, November 28, 2014

I have read that 2SLS estimator is still consistent even with binary endogenous variable (http://www.stata.com/statalist/archive/2004-07/msg00699.html). In the first stage, a probit treatment model will be run instead of a linear model. Is there any...

From: Stats Stack Exchange | By: Vincent | Friday, November 28, 2014

I'm currently implementing the M-test of Fuchs and Kenett. The goal is to determine whether a random vector $n$ could have been generated by a given multinomial distribution $p^{(0)}$. My question is whether my understanding of the formulas (converted...

From: Stats Stack Exchange | By: Omega | Friday, November 28, 2014

I am trying to construct a hypothesis test using R. I want a significant level of about 0.05. How can I find the critical value that I would you choose ? I know that I can use pbinom in order to find the signifigance level given the critical value but...

From: Stats Stack Exchange | By: user189013 | Thursday, November 27, 2014

I construct a permutation test in order to see If two samples come from the same distribution or not. I have two vectors $x,y$ that hold values of sampled values from two populations and the test statistics $mean(x)−mean(y)$. I am given a p-value and...

From: Stats Stack Exchange | By: user189013 | Friday, November 28, 2014

As a newbie in probability, I am recently cleaning my understandings about Gaussian distribution. I know that If $X$ and $Y$ are jointly Gaussian, then $aX+bY$ ($a$ and $b$ are both constant) is also Gaussian. If $X$ and $Y$ are Gaussian and independent,...

From: Stats Stack Exchange | By: Farticle Pilter | Friday, November 28, 2014

I've got a tricky computational statistics problem and I was wondering if anyone could help me solve it. Okay, so in your left pocket is a penny and in your right pocket is a dime. On a fair toss, the probability of showing a head is p for the penny...

From: Stats Stack Exchange | By: user3457834 | Friday, November 28, 2014

An event has an expected probability of .4. After 100 trials, the event occurred 30 times. How can the pvalue be computed in Excel?

From: Stats Stack Exchange | By: BSalita | Thursday, November 27, 2014

I'm learning about the Decision Tree and Random Forests. But there is something I don't really understand. I have a training set and a cross-validation set. I need to train different Random Forests, each with a different number of trees. For each forest,...

From: Stats Stack Exchange | By: JN11 | Thursday, November 27, 2014

This question was previously asked on the Quantitative Finance site (http://quant.stackexchange.com/questions/15649/how-to-calculate-the-standard-deviation-of-a-deviation-from-a-moving-average), but I didn't get much luck there, so posting here. Say...

From: Stats Stack Exchange | By: Yugmorf | Friday, November 28, 2014

In my previous question Density function for AR model, the density function of AR model has the covariance-variance matrix given as $\sigma^2 *V_p$. In multivariate gaussian distribution, the pdf contains no $\sigma^2$ term. I have conceptual questions...

From: Stats Stack Exchange | By: Ria George | Friday, November 28, 2014

There is a relation which is $p = \sum_{i=1}^\infty z_i$ . From the pdf of $z$ how do I obtain the pdf of $p$. The pdf of $z$ is Gaussian. But I don't know how to work with summation function. I am aware of convolution rule - I have 3 random variables...

From: Stats Stack Exchange | By: Srishti M | Friday, November 28, 2014

A = {class1, class2, class1}
B = {4.0,2.0,5.0}
A is a set of classes while B is a set of ordinal data. How do I calculate the correlation between the two sets?

From: Stats Stack Exchange | By: Amith | Thursday, November 27, 2014

I'm working on an online MOOC. I would like students to share their original text data for further analysis using data mining techniques without concern their usernames will be preserved in the data set. Even if string could be resolved back to the user...

From: Stats Stack Exchange | By: xtian | Thursday, November 27, 2014

Cox's 1972 publication Regression Models and Life Tables links logistic regression to an extension of the discrete time proportional hazard model. I do not understand how Equation (21) in the publication is derived. Let $T$ be a discrete random variable...

From: Stats Stack Exchange | By: Helix | Thursday, November 27, 2014

This is my first try at any regression and unfortunately I'm starting with an ordered logit model using the polr function in R. Does polr require all ordered factors with values like a,b,c to be converted into numbers like 1,2,3 instead of a,b,c before...

From: Stats Stack Exchange | By: duke_sastry | Thursday, November 27, 2014

According to a book, a distribution belongs to the exponential family if it can be written in the form of I wrote the Bernoulli distribution as $\exp\Big(y \log\,[{\mu}/{(1-\mu)}] + \log\,(1-\mu)\Big)$. In this case $a(y)=y$, $b(\theta)= \log\,[{\mu}/{(1-\mu)}],...

From: Stats Stack Exchange | By: Günal | Thursday, November 27, 2014

I have two groups of data that have different sample sizes and in order to be able to analyze both sets they must have the same variance. I was told I should use Bartlett's to test the homogeneity of variance, but when I try to run the test in R it says...

From: Stats Stack Exchange | By: pocketlizard | Thursday, November 27, 2014

Except for the fact that returns can be -ve while prices must be +ve, is there any other reason behind modelling stock prices as a log normal distribution but modelling stock returns as a normal distribution?

From: Stats Stack Exchange | By: Victor | Thursday, November 27, 2014

PROBLEM STATEMENT: Let $X$ be random variable in $m$ dimensional space. The distance between each pair of vectors $x_i^m,x_j^m$ is $D_{i,j}^m =d(x_i^m,x_j^m)$. There is a measure - Correlation Sum, $C(r)$ which represents the probability of the distance...

From: Stats Stack Exchange | By: Srishti M | Thursday, November 27, 2014

I am trying to find a commonly cited paper by John Tukey published in 1960 called "A survey of sampling from contaminated distributions", from a monograph(?) called "Contributions in Probability and Statistics". I can neither find this article from a...

From: Stats Stack Exchange | By: Matt Brenneman | Thursday, November 27, 2014

Referred to Baum–Welch algorithm, http://cs.au.dk/~cstorm/courses/MLiB_f14/slides/hidden-markov-models-4.pdf Is this formular correct ? I spend a couple days to figure out which part is wrong. I'm trying to train many of sequences. Each sequence are...

From: Stats Stack Exchange | By: Kanit Srisuthep | Thursday, November 27, 2014

In a binary classification task, I have a small training set (n=900, 9 features). The two groups are not symmetric (1 = 560, 0 = 340). I also have a test set (n=400) where I don't know the class variable. Let's say I want to check if a SVM works fine....

From: Stats Stack Exchange | By: Buzz Lightyear | Thursday, November 27, 2014

For my class project, I am working on the Kaggle competition - Don't get kicked The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices: Since the data is highly...

From: Stats Stack Exchange | By: Jatin Ganhotra | Thursday, November 27, 2014

I've worked out that some physical process has the form $y = ax_1 + (1-a)x_2$, and would like to perform regression to find $a$. I thought about multiple regression of $y$ on $x_1$ and $x_2$ and hoping the coefficients sum to 1, but I guess this isn't...

From: Stats Stack Exchange | By: kezz_smc | Thursday, November 27, 2014

I would like to know simple examples when using the MSE as a measure of parameter estimation is bad and which metrics can we use?

From: Stats Stack Exchange | By: Lost1 | Thursday, November 27, 2014

What kind of test would I use to find the answer to this? Is there evidence that cereals promoted in an in-store circular have a higher average number of units sold per store than cereals not promoted in an in-store circular? I was given data from a...

From: Stats Stack Exchange | By: Yvonne Herrera | Thursday, November 27, 2014

I have daily facebook data (past 3-4 months) of a company. I know how many fans they gained per day, fans they lost per day, engagements and so on. These are split into 'paid for', 'free from the current fan base' and 'free from outside the fan base'....

From: Stats Stack Exchange | By: Dino Abraham | Thursday, November 27, 2014

How to calculate 95% CI Data for the Following 11 readings
102.61
100.31
107.04
95.00
105.61
97.75
107.76
96.56
92.90
96.98
102.03
AVERAGE=100.41
Std. dev.=5.017

From: Stats Stack Exchange | By: Adel Mesfer | Thursday, November 27, 2014

I investigated experiments with SPSS and following values were out. I have a question that in which case is possible to get Feature 1 in Group A+B as significant p<.05 while either Group A and B has low significance in Feature 1? I hardly interpret...

From: Stats Stack Exchange | By: Youngjae | Thursday, November 27, 2014

I have reaction time data from four different age groups, and I am hoping to prove that reaction time improves with age. I know that ANOVA would be preferred to determine if age has an effect, but I am not looking to simply prove that or to compare each...

From: Stats Stack Exchange | By: kdk | Thursday, November 27, 2014

How do I find Regression coefficient if data provided is only matrix correlation table? Here is example for x1 matrix correlation. ( I also have x2,x3,x4, but only provided x1 in here) for the sake of simplicity. x1 x2 x3 x4 Pearson Correlation 1 -.519...

From: Stats Stack Exchange | By: user3213703 | Thursday, November 27, 2014

I want to predict more than one dependent variable by running one model, I thought that we can use Multivariate Multiple Regression Model. But I don't know how to do it with Excel or R.
Can anyone give me an example benchmark on this?

From: Stats Stack Exchange | By: Rabin | Thursday, November 27, 2014

The relationship between the standard normal and the chi-squared distributions is well known. I was wondering though, is there a transformation that can lead from a $\chi^2 (1)$ back to a standard normal distribution? It can be easily seen that the square...

From: Stats Stack Exchange | By: JohnK | Wednesday, November 26, 2014

In a random effects model, the composite error is defined as εit = αi + uit where αi is uncorrelated with uit ; the uit have constant variance and are serially uncorrelated. Also eit = εit - λ εi (bar on εi). Determine λ such that cov ( eit ,...

From: Stats Stack Exchange | By: Saurav | Thursday, November 27, 2014

A. You should not use the t-test since the population does not have a normal distribution. B. You may not use the t-test. The t-test is robust to non-normality for confidence intervals but not for hypothesis tests. C. You should make a transformation,...

From: Stats Stack Exchange | By: asjkldjaksjd | Thursday, November 27, 2014

Is chi-squared feature selection better than Mutual information based feature selection mechanism?

From: Stats Stack Exchange | By: Ankit Das | Thursday, November 27, 2014

Here I am going to fit a model using quasi likelihood,(because the dispersion parameter > 1, y is a binary data). But when I using variance equals to miu(1-miu), or miu^2*(1-miu)^2, the Pearson residuals looks like this. Do you have any idea why it...

From: Stats Stack Exchange | By: Su Hua | Thursday, November 27, 2014

In a logistic regression a positive/negative beta tells you that the direction that variable works. Is there anything in the Random Forest variable importance measures that indicates the direction of the variable?

From: Stats Stack Exchange | By: Quantitative72 | Thursday, November 27, 2014

I've calculated the posterior distribution of a variable X. Analytically and by simulation, but doesn't mach. X ~ Normal(mu,s=6).And the prior distribution of X is a Normal(mu=100,s=20). As my likelihood is a normal with known variance. I can calculate...

From: Stats Stack Exchange | By: Mik meadow | Thursday, November 27, 2014

This is a fairly simple question but I can't figure out which one is the correct approach. In astronomy it is usual to report age values via their base 10 logarithms instead of the actual value. So a star that is $1e8\,yrs$ old is said to have an age...

From: Stats Stack Exchange | By: Gabriel | Thursday, November 27, 2014

I am learning about K-means algorithm, and I have generated a dataset with 150000 data points, with 10000 points per cluster. (Scatter plot at the bottom) When I run K-means on the dataset, I first randomly pick 15 data points as initial centers, and...

From: Stats Stack Exchange | By: Bill Liu | Thursday, November 27, 2014

How can I fit reduced-rank regression with continuous response in R?
I found the package VGAM but it only fits for discrete distributions...

From: Stats Stack Exchange | By: Daniel Falbel | Wednesday, November 26, 2014

I was considering using natural cubic splines for my prediction problem when I had a thought: In Ridge Regression, you set out to minimize the equation; \begin{equation} F(X)=\lambda\sum_i ( b^2)+ \sum_i (b^T x_i - y_i)^2. \end{equation} Where the first...

From: Stats Stack Exchange | By: Stefan Tian | Wednesday, November 26, 2014

Let, x = [1 0.9 0.1 -0.3]'. The formula for Autocorrelation is E[x*x'] which should give a vector of mean values. The documentation in Matlab says it a vector is input then it results in a scalar. I have confusion related to implementation of this formula...

From: Stats Stack Exchange | By: SKM | Wednesday, November 26, 2014

