# Stats Stack Exchange

Given a machine learning model built on top of scikit-learn, how can I classify new instances but then choose only those with the highest confidence? How do we define confidence in machine learning and how to generate it (if not generated automatically...
From: Stats Stack Exchange | By: user2295350 | Wednesday, August 27, 2014
I want to estimate the effect of randomly assigned intervention. The outcome is measured at the individual level, but the individuals are assigned to groups which influence eachother a lot, and it is the groups which are assigned to treatment or control....
From: Stats Stack Exchange | By: SE_groupie | Friday, August 29, 2014
I am interested in supervised pattern recognition problems where the the label associated with each pattern gives the probability of membership for each of the $c$ classes, rather than assigning each pattern unequivocally to a single class. This can...
From: Stats Stack Exchange | By: Dikran Marsupial | Tuesday, August 26, 2014
I am looking into making a regression of a bunch of data that is contained on some range of real numbers. In my case, x is between 0 and 1 and y is between 0 and 10. If I have 150 data points on this plane, and want to best model the original data with...
From: Stats Stack Exchange | By: Vaishak Id | Thursday, August 28, 2014
May I ask if it is plausible to have a positive coefficient with a negative marginal/impact effect after running multinomial logit model please?
From: Stats Stack Exchange | By: Tano | Friday, August 29, 2014
Let's say you have X coins, each with a differing probability of landing heads (e.g. coin 1 has 10% landing heads, coin 2 has 20% heads, etc.). Now, let's say that you flip each coin Y times (each coin has a differing amount of flips). We know how many...
From: Stats Stack Exchange | By: Jacob Kranz | Thursday, August 28, 2014
I have this linear model: fit = glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare, data=passengers, family=binomial) summary(fit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.318162 0.571693 9.302 < 2e-16 *** Pclass -1.175648...
From: Stats Stack Exchange | By: SLOBY | Thursday, August 28, 2014
I have a data-set with range of 0 to 1. I'm fitting a distribution to it in MATLAB using fitdist(Data_set, 'exponential'). I want find third quartile with log(4)/λ . When I calculate this, the result is 4.28738As i mentioned my data set range is between...
From: Stats Stack Exchange | By: user2991243 | Thursday, August 28, 2014
I've been struggling with this problem for a couple of hours, and I could use some advice. I have a linear regression model with two continuous predictors and a categorical one (with 4 levels). I want to test for a linear trend in the categorical variable,...
From: Stats Stack Exchange | By: Meghan | Thursday, August 28, 2014
I have a plot like this. I wish to apply a model to this, however, I guess a linear regression model won't work on this. What I did was plot it on logarithm x and logarithm y axis as well but it came out to be of no use. With logarithm: I tried fitting...
From: Stats Stack Exchange | By: bjohn | Thursday, August 28, 2014
Let $X_1,...,X_n$ be iid with pdf $$f(x|\theta) = exp(\theta -x) I(x)_{(\theta, \infty)}$$ It is asked to find an unbiased estimator for$\theta$ that is function of a sufficient statistical for $\theta$ By factorization theorem, we show that $X_{(1)}$...
From: Stats Stack Exchange | By: Giiovanna | Thursday, August 28, 2014
First, I need to prove that the distribution of a RV X, where X|lambda ~ Pois(lambda), and lambda ~ gamma(a, B), is a negative binomial. I know that it is, but why negative binomial instead of another 2 parameter distribution? How do I prove negative...
From: Stats Stack Exchange | By: areyoujokingme | Thursday, August 28, 2014
I have some problem understanding and using PCA (principal component analysis) So I have used Accord.net C# framework for calculating PCA. I also read this so I better understand PCA. So if I am understanding correctly we build PCA on some data as some...
From: Stats Stack Exchange | By: TheMentor | Thursday, August 28, 2014
I have a distribution like this: What is name of this distribution? As you know we have 68–95–99.7 rule in normal distribution. Can we have something like this in this distribution? Thanks....
From: Stats Stack Exchange | By: user2991243 | Thursday, August 28, 2014
I am facing two problems while using caret package in R. I am reproducing an example below: library(mlbench) library(caret) set.seed(998) data(Sonar) #Random data, just for illustration purpose Sonar= Sonar[, 1:6] #Selected first 6 columsn only for showing...
From: Stats Stack Exchange | By: learner | Thursday, August 28, 2014
I'm new to R and I am interested in strucchange package. I have two codes; code1 and code2. **###code1** data(Nile) library(strucchange) ## fit, visualize and test OLS-based CUSUM test ## with a Cramer-von Mises functional ocus <- efp(Nile ~ 1, type...
From: Stats Stack Exchange | By: precision | Thursday, August 28, 2014
If I take some sequences of random numbers generated by a random number generator with uniform distribution, will the resulting sequences be uniformly distributed as well? By example, if I have a generator that returns 1, 2, or 3, what are the probabilities...
From: Stats Stack Exchange | By: sammy | Thursday, August 28, 2014
I have the following test data: How do I correlate the "believes" with "before" to determinate if it affects the "after"?
From: Stats Stack Exchange | By: andrepcg | Thursday, August 28, 2014
In oil and gas exploration/development it is common to use acustic impedance derived from reflection seismic surveys to predict the porosity measured in wells drilled in the reservoir. I often use tables such as the one below (from a paper) to test for...
From: Stats Stack Exchange | By: MyCarta | Thursday, August 28, 2014
Let us suppose that I have a number of features. I design pdfs for every feature and every class, some of them by smoothing some histogram of training samples, others just by introducing a priori knowledge on how the feature should look like. Now I want...
From: Stats Stack Exchange | By: user116773 | Thursday, August 28, 2014
I am comparing 2 separate datasets with the Mann-Witney U test. I am analyzing three separate parameters present in both datasets and want to know which parameter gives greatest separation between the two datasets. The lower the P value, the greater...
From: Stats Stack Exchange | By: user3790338 | Wednesday, August 27, 2014
The widely known regression equation for assessing the three-way interaction is $$Y= B_1 X+B_2 Z+B_3 W +B_4XZ+B_5XW+B_6ZW+B_7XZW+B_0$$ All lower order terms is included in the regression equation for the B7 coefficient to represent the effect of the...
From: Stats Stack Exchange | By: Funn Me | Thursday, August 28, 2014
I have three groups of data, each with a binomial distribution (i.e. each group has elements that are either success or failure). I do not have a predicted probability of success, but instead can only rely on the success rate of each as an approximation...
From: Stats Stack Exchange | By: Scott | Thursday, August 28, 2014
I carried out factor analysis on 31 5 likert-scale questions which represent my 6 constructs. The results showed 6 factors, consistent with my hypothesis but the analysis grouped my questions differently from what I expected. For example when I wrote...
From: Stats Stack Exchange | By: Hasan | Thursday, August 28, 2014
I'm wondering what the effect is if I don't include moderators in my model? Is this the same or different from an omitted variable bias? I am having a hard time grasping this conceptually. More information: I did a GEE analysis for a binary outcome for...
From: Stats Stack Exchange | By: SB3 | Thursday, August 28, 2014
How do I find the variance for $z_n=\prod_{i=1}^n(1-k_i e^{a_i x})$ where $x$ is the random variable with a normal distribution and is the same for all $i$ (which is a subscript for time dependency which is not dependent with $x$). I have tried using...
From: Stats Stack Exchange | By: Sorin | Wednesday, August 27, 2014
I am doing factor analysis to check the factorial validity of a 14-items scale with four subscales. Two items have low (less than 0.3) correlations with other items in the subscale to which they belong. However, they load considerably (greater than 0.32)...
From: Stats Stack Exchange | By: Ayalew A. | Thursday, August 28, 2014
I'm using neural network for a binary classification problem of bankruptcy prediction using patternnet function in MATLAB, so i have probability of bankruptcy for out sample (final report). I'm searching for a method to combine probability of bankruptcy...
From: Stats Stack Exchange | By: user2991243 | Thursday, August 28, 2014
What is the best way to represent the following data graphically? Can I use a histogram Year Output per Person Capital Employed 2010 16.3 units/annum £40,000 p.p 2011 15.1 units/annum £38,000 p.p 2012 14.4 units /annum £35,000 p.p 2013 11.7 units...
From: Stats Stack Exchange | By: chamzzey | Thursday, August 28, 2014
I would like to know if there is a difference in the values taken from measurements of a given characteristic sampled at primary forest=1 and secondary forest=0 I'm using two vectors, one with the data itself (DATA$Measurements), and the other one with... From: Stats Stack Exchange | By: Mohr | Wednesday, August 27, 2014 smile frown I have searched for many days trying to find the answer to this question, and am still not 100% sure I am happy with my conclusion. I am interested in looking at the effects of environmental variables on detectability of marine mammals. I have Animal... From: Stats Stack Exchange | By: Chandra | Thursday, August 28, 2014 smile frown After some searching, I find very little on the incorporation of observation weights/measurement errors into principal components analysis. What I do find tends to rely on iterative approaches to include weightings (e.g., here). My question is why is... From: Stats Stack Exchange | By: noname | Wednesday, August 27, 2014 smile frown Can somebody point me in the right direction for a treatment of the following problem? I imagine this should be a fairly common problem in medical statistics... Given two binomial random variables$X_1\sim Bin(n_1,\pi_1)$and$X_2\sim Bin(n_2,\pi_2)$... From: Stats Stack Exchange | By: Max | Wednesday, August 27, 2014 smile frown I am using a cox proportional hazards model to run a survival analysis in r on a number of non-nested, distinct covariates such as Age, Blood Type, Cancer, etc: A, B, C, D, E When I run the model on the omnibus null hypothesis: surv ~ A + B + C + D The... From: Stats Stack Exchange | By: user50710 | Thursday, August 28, 2014 smile frown Suppose I observe binary$Y_{ij}$for$i = 1, ..., N$and$j = 1, ..., J$and I want to model $$\Pr(Y_{ij} = 1 \mid \lambda_{i}) = \Phi(\lambda_{ij}), \qquad [Y_{ij} \perp Y_{ij'} \mid \lambda_i]$$ where the vector$\lambda_{i} = (\lambda_{i1}, \ldots,...
From: Stats Stack Exchange | By: guy | Thursday, August 28, 2014
Have 4 groups of results from 4 different treatments. Want to compare pre- and post treatment within groups and between groups
From: Stats Stack Exchange | By: Ray Yukna | Thursday, August 28, 2014
I am trying to calculate average wait time for one of the projects that I am working on. The way the project works is that there are pathways that students follow and each pathway has several steps. At each step students get a challenge, or challenge...
From: Stats Stack Exchange | By: nasia jaffri | Thursday, August 28, 2014
I am trying to detect automated visits to a website. A typical data set for an automated client is of the form: userid: visit_time1, visit_time2, ... 94562: 5, 10, 15, 25, 30 ^ missed So the period of visit times is 5. Occasionally (as I marked above),...
From: Stats Stack Exchange | By: Bryan Glazer | Wednesday, August 27, 2014
I can't find literature to understand this table. What I know doing my research is that an effect size should be between 0 and 1. when 0.2 is slow, 0,5 medium and 0.8 and higher , high. But In an article of Cochrane database I found these results and...
From: Stats Stack Exchange | By: user1338101 | Wednesday, August 27, 2014
I am trying to run a regression using about 80 independent variables. The problem is that the last 20+ coefficients return NA. If I condense the range of data to within 60, I get coefficients for everything just fine. Why am I getting these NAs and how...
From: Stats Stack Exchange | By: user2662565 | Wednesday, August 27, 2014
I have some code that looks for clusters in x,y data. To check the number of clusters I use, I want to get the BIC. This is not possible (easily) using kmeans(), and so I've switched to the mclust package. Specifically, I'm trying to replace kmeans()...
From: Stats Stack Exchange | By: Andy Clifton | Wednesday, August 27, 2014
Trying to create a list of significant variables for a regression model. I have 82 variables, so in order to only include the significant ones, I created a list of the correlations and sorted them. I want to include the variables with correlations >...
From: Stats Stack Exchange | By: user54804 | Wednesday, August 27, 2014
I'm trying to fit a model with the function glmer (lmer4 1.1-7 package) in R using REML but I just get an error saying extra argument(s) ‘REML’ disregarded (see below). How else can I fit it with REML to optimize the random effects structure with...
From: Stats Stack Exchange | By: Ines | Wednesday, August 27, 2014
Breusch-Pagan rejects the H0 on this residuals: > length(model$residuals) [1] 515959 > summary(model$residuals) Min. 1st Qu. Median Mean 3rd Qu. Max. -205.000 -4.420 -0.451 0.000 4.130 196.000 > quantile(model$residuals, seq(0, 1, 1/10)) 0%... From: Stats Stack Exchange | By: Robert Kubrick | Wednesday, August 27, 2014 smile frown I've been trying to locate and access circular/angular datasets for my research. I'm particularly interested in datasets that distribute according to a mean centered circular distribution with proper references for citation. Also, some insights in finding... From: Stats Stack Exchange | By: user1019667 | Wednesday, August 27, 2014 smile frown I have a variety of samples, each with a different standard deviation and mean. The coefficient of variation$CV$=${\sigma} / {\mu}$defines the amount of variation in a population or sample around its mean. Is it meaningful to then use$1/CV\$ or some...
From: Stats Stack Exchange | By: 114 | Wednesday, August 27, 2014
If we have residuals of an ARIMA(p,d,q) with known parameters how can we get the time series?
From: Stats Stack Exchange | By: Fred | Wednesday, August 27, 2014
I have 2 patient populations taken from the same time period that underwent 2 different surgical laparoscopic procedures. I want to compare the rate of conversion to an open surgical procedure. In my data this is represented as dummy code under the variable...
From: Stats Stack Exchange | By: oort | Wednesday, August 27, 2014
I'm trying to interpret some results here, and just want to make sure that my logic is sound. I'm predicting a binary outcome with a categorical predictor (gene level coded as 0, 1, or 2 dependant on the number of risk alleles present). My hypothesis...
From: Stats Stack Exchange | By: Chris C | Wednesday, August 27, 2014
I read that OLS underestimates variance when residuals are autocorrelated. I see why autocorrelation would be a problem in time series analysis, in the sense that the coefficient are not efficient because we're not including all the potential predictors....
From: Stats Stack Exchange | By: Robert Kubrick | Wednesday, August 27, 2014
