R | SeanvdM

Univariate Time Series Overview

Why do we care? Understanding and predicting time series can help us make better decisions. Better decisions can lead to more profit, or less losses, or less wasting of natural resources, and other benefits. How can we predict the unpredictable? By breaking down the problem into parts we know how to deal with. We know how to deal with independent and identically distributed values: Draw a histogram and get summary statistics.

Useful R code for the Dirichlet distribution

Simulation To simulated a Dirichlet sample quickly we use the method on the Wikipedia page for the Dirichlet distribution. rdirichlet <- function(N=1,K=c(1,1)) { # Simulations from the Dirichlet Distribution, according to the method of Wikipedia lk <- length(K) sim <- matrix(0,N,lk) gams <- matrix(0,N,lk) for (i in 1:lk) { gams[,i] <- matrix(rgamma(N,K[i]),N,1) } gamtotal <- matrix(rowSums(gams),N,lk) sim <- gams/gamtotal return(sim) } Frequentist Parameter estimation The simplest method of parameter estimation is the method of moments:

Multiple lines on a graph example

What are we doing and why? We are just going to draw a graph in R with multiple lines on one graph. This is interesting because the way base R draws graphs is a bit strange to people who are used to other packages. Some explanation is useful. Specific example In this example we are going to use the lengths of the 25 most popular movies of each year from 1931 to 2013, as explained here bu Randy Olson.

Basic predictive intervals via simulation

What are we doing and why? We are going to look forward. We want to predict the next thing. Companies want to know what’s going to happen with the next client, the next month’s profits, the next deal, etc. we are thus very much in the data science paradigm. However, a lot of data scientists are narrow in their thinking: they just want to predict the values as close as they can.

Simulating an ARIMA Time Series

Step 1: How long must the time series be? Begin by deciding on a reasonable length for your time series, based on the problem at hand. Then add a burn-in period: that is an extra piece that we add at the start of the time series while simulating but throw away later. When you simulate a time series the first part you simulate will not follow your chosen model and must be discarded.

Variance Of Sample Kurtosis

Kurtosis First we define the sample kurtosis function, as well as a matrix version. Source: Wikipedia kurtosis1 <- function(x) { xstand <- x - mean(x) kurt <- (mean((xstand)^4))/((mean((xstand)^2))^2) return(kurt) } kurtosis <- function(X) { apply(X,2,kurtosis1) } Generate samples We select the range of sample sizes and generate a matrix of samples arranged in columns, for each size. samplesizes <- seq(minsize<-100,maxsize<-10000,stepsize<-50) nsizes <- length(samplesizes) nsample.persize <- 500 samples <- vector('list',nsizes) for (i in 1:nsizes) { samples[[i]] <- matrix(rnorm(samplesizes[i]*nsample.