# Stats Stack Exchange

I'm about to get my PhD. My background is in engineering and applied mathematics. My research in statistical & machine learning methods for biological applications (without being too specific). I love mathematics and doing research, but I'm astoundingly...
From: Stats Stack Exchange | By: guestposter | Monday, September 1, 2014
I'm running some multiple group CFA models comparing covariance structure by race/ethnicity and have survey data from 6th, 8th, 10th and 12th graders. My supervisor has told me to combine 6th and 8th grade to run a middle school only model and to combine...
From: Stats Stack Exchange | By: Chris Cambron | Monday, September 1, 2014
I measured two quantities $a$ and $b$ for three large groups (with different size $n$), normally distributed: $a\pm\sigma_a$ and $b\pm\sigma_b$ (the $\sigma$'s are different and are measured uncertainty). Then I quantified the product $a\times b$ with...
From: Stats Stack Exchange | By: Terenz | Tuesday, September 2, 2014
I am working on a project where I need to chart statistical data and related, skewed distributions a la http://en.wikipedia.org/wiki/Skew_normal_distribution. Unlike with normal distributions, in these charts, when there is skew, neither the mean nor...
From: Stats Stack Exchange | By: mkoistinen | Tuesday, September 2, 2014
I'm analyzing a data set that has an extremely long tail, and I'm looking for a way to transfer the data into a bell curve so I can apply statistical analysis to it. Hope this makes sense, any help would be greatly appreciated. edit My data is the number...
From: Stats Stack Exchange | By: user3251849 | Tuesday, September 2, 2014
I just read about the deviance measure for the logistic regression. However, the part that is called saturated model is not clear to me. I did an extensive Google search but none of the results answered my question. So far I found out that a saturated...
From: Stats Stack Exchange | By: toom | Tuesday, September 2, 2014
I am a dummy in statistics. Can anyone describe me basic data about univariate vs multivariate Cox proportional-hazards. What is their difference? How are they performed in spss package? What is difference between univariate univariate Cox proportional-hazards...
From: Stats Stack Exchange | By: Malvin Zamani | Tuesday, September 2, 2014
I have fitted a binary classification gbm model, and one of the predictor variables, Affiliate has 50 different levels. Given the following readout from gbm.perf: var Importance Affiliate 52.939994 ProgramName 25.765384 Distributor 17.502216 MarketingCategoryName...
From: Stats Stack Exchange | By: Mike | Tuesday, September 2, 2014
i don't have a specific mathematical setting for this question. it's rather high level. if you have a bunch of data X for a variable that we want to predict Y, and all we care about is predicting within the range of observed values, is there utility...
From: Stats Stack Exchange | By: Majid alDosari | Tuesday, September 2, 2014
I was reading the textbook [Probability Models for DNA Sequence Evolution][1] by Durrett. In chapter 1, he discusses the Wright Fisher model and the coalescent theory which I am interested in. He defines heterozygosity as the probability that two copies...
From: Stats Stack Exchange | By: Govinda Kamath | Tuesday, September 2, 2014
The alligators example from openbugs examples repository is the same example that comes with winbugs. Basically this is a multinomial logistic regression example in which the outcome variable has 5 possible values and there are two categorical independent...
From: Stats Stack Exchange | By: sarikan | Tuesday, September 2, 2014
Are there any packages that implement the Autoclass/ Naive Bayes Clustering algorithm in R or Python? Alternatively, what are some other clustering algorithms that can handle both categorical and numeric variables that are implemented in either R or...
From: Stats Stack Exchange | By: artdv | Tuesday, September 2, 2014
Trying to build a predictive model for attrition prediction at service desk/call center. Have daily data on the following parameters: 1.Call quality(0-100%), 2.Avg. handling time(in minutes), 3.Attendance 4.Customer feedback(1/0) for both agents who...
From: Stats Stack Exchange | By: Vinay Tiwari | Tuesday, September 2, 2014
I am trying to analyze multiple variables of our student population as it comes to assessment outcomes. I have the assessment change score and demographic information such as (race, sex, age, program, # of programs, time of service, # of services). I...
From: Stats Stack Exchange | By: Joe | Tuesday, September 2, 2014
I'm using K-fold cross validation technique for generating train,test and validationindexes for neural network. My sample size is 230~700. What is best K for cross validation here. Now I'm using 10-fold cross validation but i think it is too high. what...
From: Stats Stack Exchange | By: user2991243 | Tuesday, September 2, 2014
I think the ordinary least squares are the sum of the vertical distance between the observed data to the model line(regression). And residual is calculated by add up all vertical distance between each observed data and the model line(regression), not...
From: Stats Stack Exchange | By: Jun Shi | Tuesday, September 2, 2014
I am very new to statistics and econometrics, I am programmer, but still I need to do one task and get familiar with statistics / econometrics more in details. 1) I have some index that consists of some factors. In total I have more than 30 factors....
From: Stats Stack Exchange | By: renathy | Tuesday, September 2, 2014
I have some data that I strongly suspect are lognormally distributed, and I'd like to summarize the distribution using the mean and standard deviation. I've read that with lognormal distributions the goemetric mean and standard deviation should be used,...
From: Stats Stack Exchange | By: DLaw | Tuesday, September 2, 2014
I have the following data: > dd x y 1 12.810520 0.001382742 2 15.483217 0.001243213 3 11.512925 0.001554218 4 12.130627 0.001458508 5 12.508674 0.001520110 6 8.616858 0.001934162 7 16.528807 0.001330485 8 12.499514 0.001562247 9 16.979896 0.001197619...
From: Stats Stack Exchange | By: KatyB | Tuesday, September 2, 2014
I have an interesting question that I think has not been asked yet here. I am building an AI that has as goal to predict how wrong a standard based-on-history model is. This is done based on Natural Language Processing(NLP), so from an external source...
From: Stats Stack Exchange | By: ovanwijk | Tuesday, September 2, 2014
If the following generalized linear model was used, how should I interpret the error term? link function: natural log distribution: Gamma distribution i.e., $\ln E(Y)=X\beta$ and $E(Y)=\exp(X\beta)$ It seems that the error term should be additive: $Y=\exp(X\beta)+\epsilon$...
From: Stats Stack Exchange | By: Eddy Chen | Tuesday, September 2, 2014
Let $a \sim\mathcal{N}(6.532056,0.06532056)$,$b \sim\mathcal{N}(8.390961,0.08390961)$ and $c \sim\mathcal{N}(8.736566,0.08736566)$. We use $\mathcal{N}(\mu,\sigma^2)$ notation unless specified otherwise. We construct two normal variables $x = a-b$ and...
From: Stats Stack Exchange | By: Sandipan Bhattacharyya | Tuesday, September 2, 2014
I do unsupervised clustering for a dataset using k-means algorithm. I want to know what is the difference between different distance measures (Euclidean, cityblock, cosine and correlation,...etc). I tried each of them and each gives me different outputs....
From: Stats Stack Exchange | By: Abbas | Tuesday, September 2, 2014
I am currently trying to analyse data from an experiment of mine and I have done some searching for instructions on the usage of the lme() function for R, since I am looking to analyse my data with a linear mixed effects approach. However, In case lme()...
From: Stats Stack Exchange | By: bunsenbaer | Tuesday, September 2, 2014
I am working with a survey dataset which contains hundreds of variables. The item-missing data rate ranges from 0.2% to 10%. In order to retain study units with missing values and to maintain a reasonable statistical power for my analyses, I attempted...
From: Stats Stack Exchange | By: Ayalew A. | Tuesday, September 2, 2014
I would like to evaluate the goodness-of-fit of the following (Pareto-like) distribution: $$f(r) = \sigma \centerdot r^{-\rho}$$ The function estimates the population of cities given the rank $r$ in a popularity ranking. I have not estimated the parameters...
From: Stats Stack Exchange | By: Tom | Tuesday, September 2, 2014
Various hypothesis tests, such as the $\chi^{2}$ GOF test, Kolmogorov-Smirnov, Anderson-Darling, etc., follow this basic format: $H_0$: The data follow the given distribution. $H_1$: The data do not follow the given distribution. Typically, one assesses...
From: Stats Stack Exchange | By: Clarinetist | Tuesday, September 2, 2014
I want to ask you the process to classify the tweet data. Now, I am working to Twitter data but i have confuse how to classify the tweet data using Mallet Tool. Example; I have 200,000 tweets. The tweets content of about apple company. I want to classify...
From: Stats Stack Exchange | By: user46835 | Tuesday, September 2, 2014
In a book, I've found SPSS syntax for performing an aligned rank transformation in a two way ANOVA. The procedure is: - save residuals by performing a standard ANOVA - use Aggregate to determine effects for group means (mij for interaction, ai as first...
From: Stats Stack Exchange | By: user54643 | Tuesday, September 2, 2014
I am trying to fit an ARMAX with two exogenous time series with the following code but it gives me an "computationally singular" error. I know it is about defining more than 2 time series for xreg because when I include only one exogenous it works! library(forecast)...
From: Stats Stack Exchange | By: Fred | Tuesday, September 2, 2014
I have a group of correlation coefficients (more than two). They are all dependent on one variable A in the form of r_A1, r_A2, r_A3....r_Ak, where 1, 2 ...k denote other variables; they all have the same sample size. My question is: what statistics...
From: Stats Stack Exchange | By: ksna | Monday, September 1, 2014
i'm about to start a project on R and my goal is cluster Soccer news by event. in other words, each cluster must contain all news about a specific event from multiple sources. SO, my questions are: I should use Hierarchical Clustering (like Diana or...
From: Stats Stack Exchange | By: andrealmeida | Monday, September 1, 2014
I am trying to use the "car" command in "cts package" in R program and I see the "scale" parameter there. I wonder whether this can be assumed to be equivalent to time intervals for time series forecasting. For example, the code is like this: car(x,...
From: Stats Stack Exchange | By: user3319993 | Monday, September 1, 2014
Is there some restriction to parameters $( \alpha , \beta)$ that make the beta distribution always bell-shaped? Thanks
From: Stats Stack Exchange | By: micheal | Monday, September 1, 2014
Context: The variance of a sum of independent random walks is a sum of their variances: $\sigma^2 = \sigma_1^2 + \sigma_2^2$. In case of a dependent random walks with bivariate normal distribution it will be $\sigma^2 = \sigma_1^2 + \sigma_2^2 + 2 \rho... From: Stats Stack Exchange | By: mezhaka | Monday, September 1, 2014 smile frown I have a dataset with a number of identical twin pairs and fraternal twin pairs. I want to examine the relationship between two variables (let's call them INDEPENDENT and DEPENDENT). However, I can't run a normal OLS regression because each twin's dependent... From: Stats Stack Exchange | By: SASNewb | Monday, September 1, 2014 smile frown Having the Netflix challenge in mind: collaborative filtering is typically presented as a matrix dimension reduction. My question is how does the problem relate to classical regression (supervised learning) problems? Can it be seen as an instance of... From: Stats Stack Exchange | By: JohnRos | Monday, September 1, 2014 smile frown I have a dependent variable, which have 0, 1, 2, or 3 for its value. I asked participants to choose three items and coded 1 if it is in a certain category and 0 otherwise. I add the three binary variables to be my dependent variable. So 3 is the maximum... From: Stats Stack Exchange | By: somato | Monday, September 1, 2014 smile frown I'm using R.After getting an error asking me to provide starting values for a glm (poisson family), I took a look at my data and realized I had quite a bit of zeroes. So, I tried zeroinfl from pscl. I got the "computationally singular" error, so I tried... From: Stats Stack Exchange | By: Bryan | Monday, September 1, 2014 smile frown So I am having trouble understanding what this problem is asking. (I am not asking someone to solve it but simply help interpret what the problem wants me to do.) The problem goes like this: "Let A, B, and C be three events. Show that exactly two of... From: Stats Stack Exchange | By: user2544603 | Monday, September 1, 2014 smile frown I am currently working on a modification of a clustering algorithm to suit my problem domain. I want to know which methods are available for me to compare the centroids generated from the two methods? That is, I want to know how well my (modified) clustering... From: Stats Stack Exchange | By: Kosala | Monday, September 1, 2014 smile frown I am applying a Bayesian classifier and would like to find out the f1 score. I determined the TP, TN, FP, TP. Unfortunately I had to find out that in my cross-validation almost in all test scenarios only TP and FP are non-zero (so TN=0, FN=0). Probably... From: Stats Stack Exchange | By: Wildkeiler | Monday, September 1, 2014 smile frown I have a small pilot dataset containing experimental measurements for 12 samples, 4 numeric predictors and 1 numeric outcome variable. The goal is to obtain a first rough estimate on the extent to which the outcome variable can be predicted using one... From: Stats Stack Exchange | By: Rainer | Monday, September 1, 2014 smile frown Say I have a dataset with several continuous and categorical variables, and I want to identify what variables (values or properties of these variables) cause one of the continuous variables to increase. How can I model this problem? Several approaches... From: Stats Stack Exchange | By: user023472 | Monday, September 1, 2014 smile frown This question is based on the example 11.1 out of the book Applied linear statistical models of Kutner, Nachtsheim, Neter and Li. You can find the data here. First they calculate a simple linear model: Call: lm(formula = Bloodpressure ~ Age, data = data)... From: Stats Stack Exchange | By: Kasper | Monday, September 1, 2014 smile frown I have a generic question about whether it might sometimes make sense to fix specific regression coefficients to predetermined values. And if this makes sense in particular cases, how do you best go about it? In my case, I have about 1,600 observations... From: Stats Stack Exchange | By: simon_icl | Monday, September 1, 2014 smile frown I am trying to createFolds function to use k foll CV in r. But I came across this warning: Warning messages: 1: In foldVector[which(y == dimnames(numInClass)$y[i])] <- sample(seqVector) : number of items to replace is not a multiple of replacement...
From: Stats Stack Exchange | By: Fatma Ezgi Can | Monday, September 1, 2014
My data was collected using Randomized Response Technique. So I have additional variability into the data. I need to use a GLM (logistic regression) but I should customize a logit link function to incorporate the known probabilities of Randomized Response...
From: Stats Stack Exchange | By: Luciana | Monday, September 1, 2014
I am running a Panel SVAR on country data with LR restrictions using MATLAB. I have 15 countries with 77 observations each. All my variables are demeaned. The structural VAR is constructed as follows: Based on my estimates I construct impulse response...
From: Stats Stack Exchange | By: Lukas | Sunday, August 31, 2014
Can some one help me to understand the right scale of measurement to use in sign test as well as in signed rank test? I read a certain book it says nominal scale is used in sign test while ordinal scale is used in signed rank test, the other online material...
From: Stats Stack Exchange | By: Trecy Johnson M | Monday, September 1, 2014
