I'm working off my first independent project for some pattern classification. I'm utilizing some datasets from UCI machine learning, but am not sure on how to start with data normalization. The data isn't that large (feature vector around 15-20 dimensions),...

Assuming a dataset with the following attributes: Date (truncated), f1 ... fn, #impressions, #goals. The problem: I want to grow $n$ trees that would find the optimal selection of features and their ranges in each, and that maximize the goal rate (goals...

I need to correlate employee engagement (gathered data using the 9 item UWES questionnaire) and organizational commitment (gathered data using the 18 item Organizational Commitment Scale). The both of them can be divided into different components; UWES...

I am looking to predict groups of items that someone will purchase... i.e., I have multiple, colinear dependent variables. Rather than building 7 or so independent models to predict the probability of someone buying each of the 7 items, and then combining...

I am working on a project, and I am totally new to statistics. I have sales data for last two years at week level, along with other variables like temperature, holiday (TRUE/FALSE), where holiday are nominal variables. I have to do forecasting for the...

I am working to analyze poverty rate using census data. I have a huge dataset. I want to extract the likelihood from this dataset in order to create patterns for energy consumption. Let's say this: in a house where we have 3 members with average age...

I am trying to do experiments on classifying longitudinal systems. We're working on classifying the location where we sell items most. I don't have a lot of experience in statistics and modeling data beyond a high school statistics course so I'm kinda...

$X$ and $Y$ are uniformly distributed on the unit disk. Thus,
$f_{X,Y}(x,y) = \begin{cases} \frac{1}{\pi}, & \text{if} ~ x^2+y^2 \leq 1,\\
0, &\text{otherwise.}\end{cases}$
If $Z=X+Y$, find the pdf of $Z$.

I have a question about the prediction of volatility and returns of a time series. Basically it is a question about prediction in the fGarchpackage. The following code is from the book Analysis of financial time seriesand it is an example of AR/GARCH...

I was doing some self study and came across the following formulae for estimating standard errors: Formulae 1: Formulae 2: I understand that these two can all be used when the Population Standard Deviation is unknown. But I don't really understand why...

I am looking for an introductory to intermediate level book on Generalized Linear Models. Ideally, in addition to the theory behind the models, I would want it to include applications and examples in R or another programming language - I hear SAS is...

Can you please provide
One advantage of "k-Means" compared to "Hierarchical Clustering"
One advantage of "Hierarchical Clustering" compared to "k-Means"
Thanks in advance !!

I am here to seek opinion on how should i represent my data that i have collected. I am to create a presentation focusing on environment. I was told that a simple bar chart and a line graph is a bad visualization I have picked the following data set...

I was wondering whether you could help me on this question that. I am not sure whether i am doing it correctly so any guidance from anyone would be most appreciated. I will post the full question so please do bear with me. Let X be a random varaible...

I am working on a project where I have to do multi-label text classification. I want to understand that whether my approach is correct or I am missing something. I am using R to do it. Clean the text Create a corpus. While creating the corpus I am removing...

I have $n$ dice with $m$ sides. The $i^{th}$ dice will show value $0 \leq x_i \leq m-1$ with probability $0 \leq D_i(x_i) \leq 1$. What is the probability that the sum of the dice equals $\alpha$ Is there some approximation for $P(\alpha)$...

I am using LibSVM (3.18) as an implementation of SVM. But every time when I'm predicting the result, it's giving zero. I am following these instructions: I have CSV file (+50K lines), Most of data in column (target) is zeros, the other values are between...

I am looking for a method or package in R that can remove heteroscedasticity from time series. Specifically, I have a number of time series $$Z = (Z_1, \ldots, Z_p)$$ where $Z_j = \{(Z_1)_t\}_{t=1}^{T}$ to which I want to fit a VAR model. Each time series...

I have collected data from 88 human subjects. There is two subject groups, A (test) and B (control). Number of subjects in each group is 44. The subjects are paired between groups. There is two measurements from each subject, one before, and one after...

I have a predictor with responses from 140 people in group A and 60 in group B. My mediator only uses responses from group A, and my outcome variable uses responses from Groups A, B, C $(n=31)$. What type of analysis do I need to run? What software would...

I collected some data on a species of goose called Brent Goose over the winter. A csv file of the data can be downloaded from Dropbox or imported straight into R with this code: library(repmis) goose_behaviour <- repmis::source_DropboxData("goose_behaviour.csv",...

According to my understanding, when we has unknown population mean and variance, we has to estimate its population variance through sample variance and use t distribution to estimate the potential range of population mean using estimated population variance...

Following Hofert et al.'s paper "Likelihood inference for Archimedean copulas in high dimensions under known margins," (http://dl.acm.org/citation.cfm?id=2263953) I wrote a script in Matlab to produce estimates of Archimidean copulas in high dimensions....

I have the following hourly time series data and would like to fit a best fit line to it: There seems to be a periodicity on a daily basis and a weekly basis. By this, I mean there are patterns that repeat every day (e.g. peaks during 7PM) and patterns...

What is the difference between compositional data model using additive log-ratio (alr) transformation and aggregated multinomial logit model?

I have a question about Arellano-Bond model in Stata (xtabond/xtabond2). The slopes I get, are they for levels or differences of values? My model to be estimated has a form of (D is first difference): DY=a+DX1+DX2+.... So should I use already differentiated...

I've been reading the Wikipedia page for Levene's test, and it cites the degrees of freedom as (k - 1, N - k), where k is the number of different groups to which the sampled cases belong, and N is the total number of cases in all groups. However, it...

What is the difference between finite and infinite variance ? My stats knowledge is rather basic; Wikipedia / Google wasn't much help here.

I'm working on a review paper and need to collect the means and standard deviations of a given measure (such as a measure of depression) from papers of interest. However, some authors report means and standard deviations for each item on the measure,...

I read an article that says the dependent variables in a regression model must be normally distributed. The way i understand it, is that the observations for the regression model must then be normally distributed. Or in other words if i choose sample...

From this video by Andrew Ng around 5:00 How are $\delta_3$ and $\delta_2$ derived? In fact, what does $\delta_3$ even mean? $\delta_4$ is got by comparing to y, no such comparison is possible for the output of a hidden layer, right?...

When comparing feature-based classification techniques what characteristics about the different processes should be considered? I'm comparing different classification techniques to try to figure out what should be considered when selecting a classification...

Let $\mathcal{H}\colon\mathbf{w}\cdot\mathbf{x}+b=0$ be a separating hyperplane, which some binary linear classifier results in. Let $\mathbf{x}_t$ be an unseen, new sample that appears and needs to be classified. We can predict the truth label of $\mathbf{x}_t$...

I hope I am asking this in a way that makes sense. I am comparing 8 means and want to set up a planned comparisons, rather than having my Bonferroni adjustment become overly-conservative in a post-hoc. For my groups I need to make a total of 16 comparisons,...

I estimated the mean and variance of two latent variables through two groups of data. I can't use the data to do hypothesis testing, because I am interested in the latent variable. Is there a way to test the whether the two latent variables are significantly...

For the following problem: $\text{min:}\ f(x)\\ s.t. \ g(x)\leq t$ Is the above problem equalivant to the following problem? $\text{min:}\ f(x) + \lambda g(x) \\ s.t. \ \lambda\geq0$ where $t$ and $\lambda$ are variables. It seems equalivant, because...

Assume a model like this, basically explaining stock market returns with a bunch of stuff: stockReturn(t) ~ bondReturn(t) + moneyMarketReturn(t) + inflation(t) + somethingElse(t) Does using inflation as an independent variable bring any significant problems?...

In the questionnaire I asked respondents from two countries how many job offers they received from 5 sources in the last 6 months. There are 5 questions - one for each source. It is an open question, without a scale as the two countries strongly differ...

I ran the same SEM model in sem and lavaan. I got the same parameters and - generally - very close test values, with the exception of AIC and BIC which were immensely different between the two packages. The following is the resulting AIC and BIC from...

Suppose I have a big online company, and many of my customers churned (i.e. they were paying, and then stopped). My goal is to understand why each of them churned. First I identify the complete set of reasons for churning, $H_1,\ldots,H_n$. E.g. "the...

I have a variable whose value I can only measure at the end of life of a product (which is not fixed). The variable's value, continuous and between 0 and 100, may be related to its age at that time. My data consists of the various ages of a set of products...

I am talking about a situation in which I have several continuous predictor variables predicting a continuous outcome. One of the predictors has a very non-normal distribution and has some wild outliers. I intend the generalize the regression model to...

Good evening all, I am doing a self-study exercise, but have been quizzed by a part of the question on finding percentage points of a normal distribution. I fully understand the first part of the question and was able to find the answer, which corresponds...

How do you interpret the results of a multivariate probit regression? Is it interpreted the same way as OLS?

I have a set of data with features of movies and features of users and a third matrix with ratings of user for each movie. I have to build a recommendation system for new users. Can you help me with the problem? I am not sure how to go about it. What...

I am looking for a python library or module function that allows me to estimate probability densities p(x) using the Parzen-window approach with a Gaussian kernel (with variable sigma, or 'window width') I managed to implement the Parzen-technique using...

I m a PhD student in New Zealand. I need to determine the impact of lameness in milk yield of cows. I measured milk yield daily as well as I recorded the cows that were observed lame in any one day . I recorded data daily for 325 consecutive days. it...

Please forgive this silly answer, I'm fairly new to statistics. Consider this R code: a = c(1,2,3,4,3,2,3,4,5,5,6,5,4,3,4,5,6,7,8,7,6,6,5,6,7,10,9) b = c(10,9,7,6,5,6,7,8,4,6,6,5,4,5,6,5,4,5,6,7,5,4,4,5,4,3,2) mean((a - mean(a))*(b-mean(b))) [1] -2.42524...

Negative Binomial distribution can be parameterized using mean, $\mu$, and overdispersion $\psi$, so that the variance of NB is $\mu + \frac{\mu^2}{\psi}$. We know there is no analytical solution for estimating $\psi$. I understand the variance of NB...

my problem is that I want to implement a Parzen-window estimation for a Gaussian Kernel, but I have a problem understanding how I can check whether a point (2D or 3D) lies within a Gaussian sphere. Given a set of sample points, I want to check how many...

