# Stats Stack Exchange

I am building a model and I think that geographic location is likely to be very good at predicting my target variable. I have the zip code of each of my users. I am not entirely sure about the best way to include zip code as a predictor feature in my...
From: Stats Stack Exchange | By: ansonw | Wednesday, April 23, 2014
I am trying to find the area with thickest density. I have newborn numbers by day for which I am trying to find the series of 7 consecutive days that would produce the highest frequency. So the start and end dates are not tied to any particular day of...
From: Stats Stack Exchange | By: user3507767 | Thursday, April 24, 2014
I've seen answers discussing the complexity of training SVMs and neural nets, but how about for predicting new responses once a model has been trained? For context, I'm working on an app that should produce predictions in near real-time given incoming...
From: Stats Stack Exchange | By: user1956609 | Monday, April 21, 2014
(b) The fitted coefficients and summary table including t-statistics and the overall F statistic for the fit etc. (c) Appropriate diagnostic plots. (d) Description of any action taken, as a result of the diagnositic plots. (e) A discussion of the results....
From: Stats Stack Exchange | By: ash | Thursday, April 24, 2014
I'd like to use a regression discontinuity design to evaluate a program where the discontinuity/assignment to treatment occurs at the group level. However, I'd like to measure the outcome at the individual level (as opposed to measuring the outcome as...
From: Stats Stack Exchange | By: programevaluator | Wednesday, April 23, 2014
Suppose we can choose from two different catalysers. 10 observations are taken from the first one and 12 from the other one. If $s_1 = 14$ and $s_2 = 28$, can we reject at $\alpha = 5\%$ the hypothesis that the variances are equal? Here is what the teacher...
From: Stats Stack Exchange | By: Dave | Wednesday, April 23, 2014
I'm optimizing 5 parameters for an option pricing model. Now I want to asses whether these parameters are stable over time (i.e., a year). For this I create about 12 subsamples and estimate the optimal parameters for each subsample (I do this by minimizing...
From: Stats Stack Exchange | By: jcfrei | Thursday, April 24, 2014
Let $X$ be an $m\times n$ ($m$: number of records, and $n$: number of attributes) dataset. When the number of attributes $n$ is large and the dataset $X$ is noisy, classification gets more complicated and the classification accuracy decreases. One way...
From: Stats Stack Exchange | By: user1468089 | Thursday, April 24, 2014
I have the following table: From here I want to answer the following question: P(CHILD CAN RECOG BEG SOUNDS OF WORDS – YES | CHILD RECOGNIZES LETTERS - Most of them). From my understanding of this I am saying 81.1%, am I reading this correctly?...
From: Stats Stack Exchange | By: MCP_infiltrator | Thursday, April 24, 2014
I sucessfully find the standard error of difference, but can't get the t value right. According to this calculator , I am supposed to get a sd of 0.092, which I do. But for some reason, when I do $t = \bar{d} / (sd / \sqrt{11}$ I get 11 instead of 3.xxx...
From: Stats Stack Exchange | By: Dave | Thursday, April 24, 2014
I have a dataset in which I have measured a dependent variable (let's call it $Y$) along with several independent variables $(X_1, X_2, X_3)$. The independent variables are correlated with one another to some extent. I would like to understand how $Y$...
From: Stats Stack Exchange | By: Italianice | Thursday, April 24, 2014
Here is perhaps a simple question. Can beta distribution approach a step function on $[0,1]$, in say $L_p$ norm, $p\ge 1$ ?
From: Stats Stack Exchange | By: Hans | Thursday, April 24, 2014
I am building an item-based collaborative filter recommendation system. I have a matrix of users and items, which in this case, are products that were either bought or not (i.e., binary: 1 or 0). I can build an item-based CF out of this co-occurrence/association...
From: Stats Stack Exchange | By: ansonw | Wednesday, April 23, 2014
In the linear SVM model, one may have the following equation to describe how to achieve a maximal margin while still classifying the data into 2 groups: $$L(w, \epsilon) = w\cdot w + \lambda\sum\limits_{i=1}^R \epsilon_i$$...
From: Stats Stack Exchange | By: CodeKingPlusPlus | Wednesday, April 23, 2014
Below I'm using a negative binomial because it is more flexible than a simple poisson model. The data are counts $y$ of events for 16 individuals $x$. There are 14 counts (i.e. counting periods) for each individual. The likelihood function is dnegbin($p_x... From: Stats Stack Exchange | By: user12719 | Wednesday, April 23, 2014 smile frown I am trying to model the time until some event occurs for individuals observed over a 24 month period. For about 75% of people, no event occurs. For 15% of people, we know exact time of the event. For the other 10%, we only know a time window in which... From: Stats Stack Exchange | By: Homina | Wednesday, April 23, 2014 smile frown How we chose starting initial value in R to estimate the parameter by goodness of fit or MLE From: Stats Stack Exchange | By: farrukh jamal | Wednesday, April 23, 2014 smile frown I am not a statistician or mathematician but am trying to learn. My question: In Bayes Theorem,$p(C|X)=p(X|C)p(C)/p(X)$, what are the English terms for$p(X|C)$and$p(C)/p(X)$? In other words, is$p(C)$the prior probability? What is the comparable... From: Stats Stack Exchange | By: user6092 | Wednesday, April 23, 2014 smile frown I have got two questions on an agricultural field trial that was conducted at two sites in two conscutive years. Virtually everything was the same in all trials (crop variety, planting density...). The data permits that all data can be analysed in a... From: Stats Stack Exchange | By: mcg | Wednesday, April 23, 2014 smile frown Given this information about 3 groups in an ANOVA test how would I get the SStotal, SSBetween, and SSwithin? Group 1 (5 people): Mean = 4.4 SD = 1.67 Var = 2.7889 Group 2 (4 people): Mean = 4.75 SD = 0.96 Var = 0.9216 Group3 (5 people): Mean = 4.6 SD... From: Stats Stack Exchange | By: user109444 | Wednesday, April 23, 2014 smile frown I have got two questions on an agricultural field trial that was conducted at two sites in two conscutive years. Virtually everything was the same in all trials (variety, spacing, planting density...). The data permits that all data can be analyised... From: Stats Stack Exchange | By: mcg | Wednesday, April 23, 2014 smile frown I am interested in approximating the mean$E[f(X)]$where$X$is a random variable with mean$\mu$, variance$\sigma^2$and support$\{0,...,n\}$where $$f(X):=\sum_{i=X}^n \beta(i)$$ with $$\beta(i):={n \choose i} p^i (1-p)^{n-i}$$ I've stated this... From: Stats Stack Exchange | By: user136457 | Wednesday, April 23, 2014 smile frown Suppose I have a survival model like this: set.seed(123) require(survival) df<-data.frame(time=as.integer(rnorm(100,50,5)), status=rbinom(100,1,0.7), age=rnorm(100,60, 5), gender=rbinom(100,1,0.5)) df$time=50,50,df$time) fit<-survreg(Surv(time,... From: Stats Stack Exchange | By: David Z | Wednesday, April 23, 2014 smile frown I'm reading through Fan's SCAD paper and I feel like I'm not getting a simple step. On page 1354 where he is talking about quadratic approximations to a penalty function he has $$\left[\rho_\lambda(|\beta_j|)\right]' = \rho'_\lambda(|\beta_j|)\text{sgn}(\beta_j)... From: Stats Stack Exchange | By: Benjamin | Wednesday, April 23, 2014 smile frown I understand AR(p) model, its input is the time series being modelled. I'm completely stuck when reading about MA(q) model: its input is innovation or random shock as it's often formulated. The problem is I can't imagine how to get innovation component... From: Stats Stack Exchange | By: werediver | Wednesday, April 23, 2014 smile frown I'm sure this is a very straightforward question but it came up in my work today and I could not think of the reasoning behind it. I had two sets of numeric values (A & B) and was looking at the median of their ratio and noticed that median(A)/median(B)... From: Stats Stack Exchange | By: Steve Reno | Tuesday, April 22, 2014 smile frown I have a binary classification problem and using neural network and SVM for it. So I choose a threshold (For instance 0.5) for output of neural network. If output is greater than 0.5 it belongs to class 1 and if it is smaller than 0.5 it belongs to class2.... From: Stats Stack Exchange | By: user2991243 | Wednesday, April 23, 2014 smile frown I have a huge matrix (10*10k). I'd like to know if there is a way to find similarities between lines. Let's give an example of matrix: 4*5 col1 col2 col3 col4 0 0 1 0 2 3 4 5 2 3 2 3 0 0.1 1 0 0 0 1 0 I'd like to know if there is a statistical theory... From: Stats Stack Exchange | By: user3378649 | Monday, April 21, 2014 smile frown I am testing for breusch pagan in time series data, ￼I have regressed the residual on the independent variables and added a lag for the dependent variable, is this the right way to be going about testing Breusch Pagan in time series data?... From: Stats Stack Exchange | By: kaye | Wednesday, April 23, 2014 smile frown For my research on Mental health problems and correlates, child-parent relationship have identified as a correlate. Child-parent relationship is planned to measure with 6 questions and each question have five point scale of answers Example- question... From: Stats Stack Exchange | By: user44311 | Wednesday, April 23, 2014 smile frown I'm coding an app, a part of which is graphing values from a database. The graph plots the average value of every 10% of the values up to 100% so there are ten points along the x-axis. The graph shows a trend of the lifetime of the stats. Hopefully this... From: Stats Stack Exchange | By: OliverJ90 | Wednesday, April 23, 2014 smile frown I like to tinker in my spare time with clustering algorithms. Over the past few days I was attempting to tinker with a clustering algorithm using density fields of the data. I tried several variations and I was surprised that my algorithms were reasonably... From: Stats Stack Exchange | By: user1172468 | Monday, April 21, 2014 smile frown similar questions have been asked but have not managed to get a conclusion from them. I am comparing two sets of samples, where ratios have been obtained for several analytes per sample. So the values are restricted to be in the [0,1] interval. One set... From: Stats Stack Exchange | By: David | Wednesday, April 23, 2014 smile frown I'm trying to cluster meaningfully a set of objects characterized by a vector space (bag-of-words) model. Each of those 5000 objects has 1-8 features ("words") from a set of 5500 possible. I used a vector space model (A_i = 1 if feature i is present)... From: Stats Stack Exchange | By: user44212 | Monday, April 21, 2014 smile frown I've got some problems with selecting right model/method in my analysis. Two groups of animals (differ by "Treatment") were measured from 1-st to 42-nd day, one common value for each group was measured by each day (food consumption per day). I need to... From: Stats Stack Exchange | By: Andrey Myslivets | Tuesday, April 22, 2014 smile frown X follows Po(80). I used a normal approximation to get P(55\leq X\leq 75) which I got correct. I need to find P(X=80). I tried the Poisson distribution directly; however, my calculator shows an error when calculating 80^{80}. The correct... From: Stats Stack Exchange | By: metric | Wednesday, April 23, 2014 smile frown I have a clarifying question. There is a normality assumption when it comes to consider OLS models and that is that the errors be normally distributed. I have been browsing through Cross Validated and it sounds like Y and X don't have to be normal in-order... From: Stats Stack Exchange | By: user44278 | Tuesday, April 22, 2014 smile frown If the answer is "it depends", what does it depend on? Does convergence depend on the ratio of predictor variables to sample size, or the size of R^2, or something else? I am mainly interested in CIs on Unadjusted R^2, but would also be interested... From: Stats Stack Exchange | By: user1205901 | Wednesday, April 23, 2014 smile frown Hi all, I am an undergraduate student who is currently doing an assignment. I am now facing a few problems which are:- 1) Age is usually a positive return to wage, but in my regression output, that's not the case. How should I interpret the signs and... From: Stats Stack Exchange | By: Lee | Wednesday, April 23, 2014 smile frown Package RHmm (R) I have a vector which I fit into a hmm model in an attemp to select an optimal number of states for a hidden markov model x<-c(-0.0961421466,-0.0375458485,0.0681121271,0.0259201028,0.0016780785,0.0311860542, 0.0067940299,0.0126520055,0.0357599812,0.0007679569,0.0409759326,0.0560839083,-0.0272581160,-0.0439501404,0.0321578353,0.0196158110,-0.0097262133,-0.0226182376,0.0119897380,-0.0099522863,-0.0359443106,-0.0039363349,-0.0476283592,-0.0383203835,-0.0518624079,0.0187455678,0.0950535435,0.0057115192,-0.0307805051,-0.0272725295,-0.0254645538,-0.0102565781,-0.0267986024,-0.0482906267,-0.0256826510,-0.0414746754,-0.0470666997,0.0284912760,0.1021992517,0.0875572274,0.0064152031,0.0200731787,-0.0091688456,-0.0575608699,-0.0442028942,-0.0277449185,-0.0115369429,0.0084710328,0.0745290085,0.0159369842,-0.0784550401,-0.0934970644,-0.0978390888,0.0160188869,0.0275268626,-0.0552651617,0.0033928140,0.0468507896,0.0374087653,0.0521167410,-0.0177752833,-0.0592673076,0.0514406681,0.0847486437,0.0738066194,-0.0098354049,-0.0572274292,0.0478305465,0.0096885221,-0.0445535022,-0.0153455265,-0.0105375508,0.0100704249,-0.0035215994,0.0243363762,0.0504443519,0.0570023276,0.0395103033,-0.0612817210,-0.0557737453,-0.0273657697,-0.0220077940,0.0083501817,0.0275081574,0.0323161331,0.0385741087,0.0175820844-0.0410599399,-0.0071019642,0.0431060115,-0.0107360128,-0.0007280372,0.0360799385,-0.0061620858... From: Stats Stack Exchange | By: Barnaby | Wednesday, April 23, 2014 smile frown Based on the following relationship between Matthew's Correlation Coefficient (MCC) and Chi Square: (MCC is the Pearson product-moment Correlation Coefficient) Is it possible to conclude that: By having: Imbalanced Binary Classification Problem, N =... From: Stats Stack Exchange | By: Hamed | Monday, April 21, 2014 smile frown I have data \mathbf{X} that can be expressed as a sequence of i.i.d. vectors [\mathbf{x}_1,\ldots,\mathbf{x}_n], each of length m. That is, each vector \mathbf{x}_i is drawn independently from a known distribution family \mathcal{D}^m_{\theta},... From: Stats Stack Exchange | By: M.B.M. | Wednesday, April 23, 2014 smile frown I am looking at the Wikipedia entry for empirical Bayes, but it's a bit confusing - it seems to me the solution must apply only to the case in which there's only n=1 sample y for each \theta and the "sample mean" that's referred to is really just... From: Stats Stack Exchange | By: user44285 | Wednesday, April 23, 2014 smile frown I'm trying to compare my model results with an experiment. I want to overlap the model curve with experiment curve from a paper. What are my options for digitizing the graph from an image in the pdf paper to excel data? From: Stats Stack Exchange | By: user40 | Tuesday, April 22, 2014 smile frown I have been the following mixed normal distribution to my data in R and have come up with the following density plot and ecdf (with fitted distribution overlayed in red). The equation of my theoretical distribution is (approximately)$$0.15N(mean=16.37,sd=0.2164)+0.85N(mean=17.84,sd=0.6303)$\$...
From: Stats Stack Exchange | By: user135784 | Tuesday, April 22, 2014
Why is the "de-facto" in statistics to minimize the sum of squared errors cost function instead of maximizing some reward function like the likelihood function?
From: Stats Stack Exchange | By: blast00 | Tuesday, April 22, 2014
I am using Logistic Regression in a low event rate situation. Overall universe: 46,000 Events: 420 Conventional logistic regression models divide the data into training and test sets and compute the error rates. The final coefficients and threshold levels...
From: Stats Stack Exchange | By: Maddy | Tuesday, April 22, 2014
Are there ways to estimate the finite sample bias with instrumental variables? I guess this would be conditional on assuming some structure to the problem and also would involve simulation, but, at least in applied econometrics papers, I have never seen...
From: Stats Stack Exchange | By: carloscinelli | Tuesday, April 22, 2014
I have a dataset containing two columns and a total of 90 rows. The data is from my experiment where in the first column I have an integer representing the quantity, in the second column I have a percentage. A small example: Quantity Percentage 1 53%...
From: Stats Stack Exchange | By: GreatEyes | Tuesday, April 22, 2014
I have a big dataset and I want to build a classification model (svm, rf, ann etc.). Then I split the original dataset into training set and test set. I build the model using training set. After it was done, I use the model to predict the test set. Here,...
From: Stats Stack Exchange | By: Kevin | Tuesday, April 22, 2014
