In the following report, we use data gathered via fitness trackers to build a predictive model. The model’s goal is to correctly identify the type of weight lifting movement based on the fitness tracker data. We perform a brief exploratory analysis of the data, decide on which machine learning algorithm to apply in building our model, and then train a model and apply it to the given testing set.
First, we load all packages needed for our analysis. Note that lattice is loaded as it is a requirement for the caret package, however, for our plotting purposes, we will use ggplot2 . Next, we read in the data from the respective urls in the code chuck below.
library(ggplot2)
library(data.table)
library(lattice) # Required for 'caret' package
library(caret)
training <- fread('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv')
testing <- fread('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv')
Let’s get to know the data and its structure.
head(training[,1:20])
## V1 user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1: 1 carlitos 1323084231 788290 05/12/2011 11:23
## 2: 2 carlitos 1323084231 808298 05/12/2011 11:23
## 3: 3 carlitos 1323084231 820366 05/12/2011 11:23
## 4: 4 carlitos 1323084232 120339 05/12/2011 11:23
## 5: 5 carlitos 1323084232 196328 05/12/2011 11:23
## 6: 6 carlitos 1323084232 304277 05/12/2011 11:23
## new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1: no 11 1.41 8.07 -94.4 3
## 2: no 11 1.41 8.07 -94.4 3
## 3: no 11 1.42 8.07 -94.4 3
## 4: no 12 1.48 8.05 -94.4 3
## 5: no 12 1.48 8.07 -94.4 3
## 6: no 12 1.45 8.06 -94.4 3
## kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 1: NA NA NA NA
## 2: NA NA NA NA
## 3: NA NA NA NA
## 4: NA NA NA NA
## 5: NA NA NA NA
## 6: NA NA NA NA
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## 1: NA NA NA NA
## 2: NA NA NA NA
## 3: NA NA NA NA
## 4: NA NA NA NA
## 5: NA NA NA NA
## 6: NA NA NA NA
## max_yaw_belt
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## 6: NA
unique(training$user_name)
## [1] "carlitos" "pedro" "adelmo" "charles" "eurico" "jeremy"
unique(training$classe)
## [1] "A" "B" "C" "D" "E"
training[, user_name := factor(user_name)][, classe := factor(classe)]
There are six subjects and five different types of exercise. The metadata (found at https://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har#weight_lifting_exercises) indicates that of the exercise classes, A is a specified exercise with correct form while the others are the same exercise with various common form mistakes. We made both user_name and classe factors. Additionally, we notice there are a number of variables that seem to have many NA values. Let’s investigate those variables further.
percent.na <- data.table()
for (n in names(training)){
na <- sum(is.na(training[,get(n)]))
percent <- (na/nrow(training))*100
percent.na <- rbind(percent.na, data.table('var' = c(n), 'percent' = c(percent)))
}
head(percent.na[percent>0])
## var percent
## 1: kurtosis_roll_belt 97.98186
## 2: kurtosis_picth_belt 98.09398
## 3: kurtosis_yaw_belt 100.00000
## 4: skewness_roll_belt 97.97676
## 5: skewness_roll_belt.1 98.09398
## 6: skewness_yaw_belt 100.00000
percent.na[percent > 0, min(percent)]
## [1] 97.93089
The above code tells us of all the variables with NA values, the variable with the least NA values still is 97.9% missing values. That many missing values is not worth dealing with, so we will exclude those columns from our analysis. Note that these variables seem to be calculated variables, such as kurtosis, skewness, average, and variance. Removing these from our training set leaves us with the raw positional and acceleration data.
calcvars <- percent.na[percent > 0, var]
rawvars <- names(training)[!(names(training) %in% calcvars)]
training <- training[,..rawvars]
Next, we’ll plot a few of our variables to see if there are any obvious connections.
ggplot(training, aes(total_accel_belt, accel_dumbbell_z, color = classe))+ geom_point(alpha =.3)
ggplot(training, aes(total_accel_belt, accel_dumbbell_y, color = classe))+ geom_point(alpha =.3)
ggplot(training, aes(total_accel_arm, accel_dumbbell_z, color = classe))+ geom_point(alpha =.3)
The plots above suggest that there is probably no single clear dependency to use as our predictor in our model. We can infer from this that more complicated algorithms may be needed.
The first task is to divide our training set into a few different training and validation sets. We will use k-fold cross validation to assess the competency of different algorithms to predict exercise class. We use k=5 for our folds to limit the impact of the bias-variance trade off.
trainingraw <- training[, 7:60]
folds <- createFolds(trainingraw$classe, k=5)
Now that our training set has been partitioned into five folds, let’s test a couple models. We will start with two models, one using boosting, and one using linear discriminant analysis. The following code chunks iterate through the folds, train a model to the data leaving out the current fold, then test the model on the current fold that was left out in model training. Then model accuracy is saved and the model discarded.
results.gbm <- data.table()
for (f in 1:5){
validate <- trainingraw[folds[[f]]]
validate[, classe := factor(classe, levels = c('A', 'B', 'C', 'D', 'E'))]
train <- trainingraw[-folds[[f]]]
mdl <- train(classe~., data = train, method = 'gbm', verbose = FALSE)
pred <- predict(mdl, validate)
res <- confusionMatrix(pred, validate$classe)
results.gbm <- rbind(results.gbm, res$overall[1])
}
results.lda <- data.table()
for (f in 1:5){
validate <- trainingraw[folds[[f]]]
validate[, classe := factor(classe, levels = c('A', 'B', 'C', 'D', 'E'))]
train <- trainingraw[-folds[[f]]]
mdl <- train(classe~., data = train, method = 'lda', verbose = FALSE)
pred <- predict(mdl, validate)
res <- confusionMatrix(pred, validate$classe)
results.lda <- rbind(results.lda, res$overall[1])
}
Now we compare each algorithm’s accuracy.
results.gbm[,mean(x)]; results.lda[,mean(x)]
## [1] 0.9869533
## [1] 0.7127714
ggplot(data.table(results.gbm, 1:5), aes(V2, x)) +
geom_point( color = 'steelblue', alpha = .5) +
geom_line( color = 'steelblue', alpha = .5, size = 1) +
geom_point(aes( 1:5,results.lda[,x]), color = 'coral', alpha = .5) +
geom_line(aes( 1:5,results.lda[,x]), color = 'coral', alpha = .5, size= 1) +
labs(y = 'accuracy', x = 'fold')
Clearly, the boosting algorithm (blue) is much more accurate at predicting classe. Thus, for our final model selection, we will use only the boosting algorithm. That model is trained with the code chunk below.
finalmdl <- train(classe~., data = trainingraw, method = "gbm", verbose = FALSE)
With our algorithm chosen, and our model trained appropriately, we can now apply our model to the testing set. Recall, even though our average in sample error rate is around 1%, we generally do not expect to see an out of sample error rate that low. However, as our k-fold validation iterations showed little difference between the in sample error rates (shown below), our expected out of sample error rate should not more than few points higher.
results.gbm
## x
## 1: 0.9905732
## 2: 0.9813965
## 3: 0.9862420
## 4: 0.9859837
## 5: 0.9905708
In the code below, we use our model to predict the testing set. We don’t have access to the classe for the testing set to check our model against, so we simply view the predicted ones.
test.pred <- predict(finalmdl, testing)
test.pred
## [1] B A B A A E D D A A B C B A E E A B B B
## Levels: A B C D E