Summary

We want to answer the question of whether participants are performing barbell lifts correctly. Our data comes from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Data loading

We fetch the train and the test data. More information on the data is available here.

Exploration of the data revealed that many numeric variables have values “NA” and “#DIV/0!” in some rows, so we convert these to numeric NA values on load.

library(dplyr)
library(caret)
library(gbm)

if(!file.exists("data")) {
  dir.create("data")
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "data/train.csv")
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "data/test.csv")
}
trainFile <- read.csv("data/train.csv", na.strings = c("#DIV/0!", "NA"))
testFile <- read.csv("data/test.csv")

Study design

First we must consider the study design (the split of data between training, validation and testing sets).

rbind(trainFile=dim(trainFile), testFile=dim(testFile))

##            [,1] [,2]
## trainFile 19622  160
## testFile     20  160

There are 160 variables (including the outcome classe). We see that there are 19622 observations in our train file, and only 20 in our test file. We set aside the test file data for now (this is only intended for the final automated test part of the assignment). Since we have a large sample size, a rule of thumb for splitting our data would be 60% train, 20% validation, 20% test. We split the trainFile data into 80% training data (which the Caret train() function will split into train and validate data) and 20% testing data (which we will use to report out of sample error).

We create our train set

set.seed(1234)
inTrain <- createDataPartition(trainFile$classe, p=.8, list=FALSE)
train <- trainFile[inTrain,]
test <- trainFile[-inTrain,]

Data exploration

Plotting predictors

We do exploration only on the training set to avoid making variable selection decisions on the test set, which could cause overfitting.

We see the data can be divided into raw accelerometer data and aggregated derived features based on windows of data. We suspect it may be easier to separate the classes using these aggregated derived features.

Usually we would set aside the test file and only use segmented training data to build and refine our model. However, the available test data shows us which variables/features will be available when we apply our model. If we inspect the test file, we see that all the aggregated features are NULL.

str(testFile, list.len=20)

## 'data.frame':    20 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 6 5 5 1 4 5 5 5 2 3 ...
##  $ raw_timestamp_part_1    : int  1323095002 1322673067 1322673075 1322832789 1322489635 1322673149 1322673128 1322673076 1323084240 1322837822 ...
##  $ raw_timestamp_part_2    : int  868349 778725 342967 560311 814776 510661 766645 54671 916313 384285 ...
##  $ cvtd_timestamp          : Factor w/ 11 levels "02/12/2011 13:33",..: 5 10 10 1 6 11 11 10 3 2 ...
##  $ new_window              : Factor w/ 1 level "no": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  74 431 439 194 235 504 485 440 323 664 ...
##  $ roll_belt               : num  123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
##  $ pitch_belt              : num  27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
##  $ yaw_belt                : num  -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
##  $ total_accel_belt        : int  20 4 5 17 3 4 4 4 4 18 ...
##  $ kurtosis_roll_belt      : logi  NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : logi  NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : logi  NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ max_roll_belt           : logi  NA NA NA NA NA NA ...
##  $ max_picth_belt          : logi  NA NA NA NA NA NA ...
##  $ max_yaw_belt            : logi  NA NA NA NA NA NA ...
##   [list output truncated]

Futhermore the test data points are not continous in time. This means that we cannot derive the aggregated features from our test data points. Therefore we cannot use aggregated features at all in our model training. By inspecting the data we see the aggragted features are those starting with ‘total_’, ‘skewness_’, ‘max_’, ‘min_’, ‘amplitude_’, ‘var_’, ‘avg_’, ‘stddev_’.

There are more columns that we won’t be using as features in our classifier: - X: the row index - user_name : the name of one of the 6 participants. We want out classifer to detect outcome class independant of the person perfroming the experiment - new_window and num_window: these define the windows used to create derived features. -classe` the dependent, outcome variable we wish to predict

train_reduced <- train %>% select(-c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, num_window, new_window), -starts_with("total_"), -starts_with("skewness_"), -starts_with("max_"), -starts_with("min_"), -starts_with("amplitude_"), -starts_with("var_"), -starts_with("avg_"), -starts_with("stddev_"), -starts_with("kurtosis_"))

colnames(train_reduced)

##  [1] "roll_belt"         "pitch_belt"        "yaw_belt"         
##  [4] "gyros_belt_x"      "gyros_belt_y"      "gyros_belt_z"     
##  [7] "accel_belt_x"      "accel_belt_y"      "accel_belt_z"     
## [10] "magnet_belt_x"     "magnet_belt_y"     "magnet_belt_z"    
## [13] "roll_arm"          "pitch_arm"         "yaw_arm"          
## [16] "gyros_arm_x"       "gyros_arm_y"       "gyros_arm_z"      
## [19] "accel_arm_x"       "accel_arm_y"       "accel_arm_z"      
## [22] "magnet_arm_x"      "magnet_arm_y"      "magnet_arm_z"     
## [25] "roll_dumbbell"     "pitch_dumbbell"    "yaw_dumbbell"     
## [28] "gyros_dumbbell_x"  "gyros_dumbbell_y"  "gyros_dumbbell_z" 
## [31] "accel_dumbbell_x"  "accel_dumbbell_y"  "accel_dumbbell_z" 
## [34] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [37] "roll_forearm"      "pitch_forearm"     "yaw_forearm"      
## [40] "gyros_forearm_x"   "gyros_forearm_y"   "gyros_forearm_z"  
## [43] "accel_forearm_x"   "accel_forearm_y"   "accel_forearm_z"  
## [46] "magnet_forearm_x"  "magnet_forearm_y"  "magnet_forearm_z" 
## [49] "classe"

Next we explore the relationship between these varialbes and the outcome via plots and understand them. See appendix for the plots.

Looking at these we see that this is not single variable that can cleanly separates the five clases. Furthermore the “gyros_” variables seem to have one or two extreme outliers.

summary(train_reduced$gyros_dumbbell_y)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.10000 -0.14000  0.03000  0.04688  0.21000 52.00000

We aslo note that not all variables are on the same scale.

Training a model

We choose gradient boosted trees as the model. This type of model is appropriate since it:

can achieve excellent accuracy on data mining tasks
is not very sensitive to outliers in the data
can give variable importance scores that aid in interprebality
is insensitive to monotone transorfmation of inputs (which means we do not need to scale all variables to be the same range)

The trainControl configuration will instruct the caret train function to use 5 fold cross validation for measuring the training accuracy.

fitControl <- trainControl(method = "repeatedcv", number=5, repeats = 5)
set.seed(123)

if(!file.exists("gbmFit1.rds")) {
  gbmFit1 <- train(classe ~., data=train_reduced, method="gbm", trControl = fitControl)
  saveRDS(gbmFit1, "gbmFit1.rds")
} else {
  gbmFit1 <- readRDS("gbmFit1.rds")
}
gbmFit1

## Stochastic Gradient Boosting 
## 
## 15699 samples
##    48 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 12558, 12558, 12560, 12559, 12561, 12559, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7521998  0.6858080
##   1                  100      0.8220769  0.7748878
##   1                  150      0.8534927  0.8146800
##   2                   50      0.8577996  0.8198443
##   2                  100      0.9090127  0.8848640
##   2                  150      0.9332308  0.9155114
##   3                   50      0.8995470  0.8728354
##   3                  100      0.9443653  0.9296031
##   3                  150      0.9628761  0.9530323
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

The best cross validation train accuracy achieved here is 96%. That the final model here has interaction.depth = 3 (an indicator of the depth of each of the trees in the 150 tree model), indicates that the classes are best separated by considering interactions between mulitple variables. This is consistent with our exploratory plotting where we saw no single variable was able to clearly distinguish between the classes.

ggplot(gbmFit1)

We see that for this set of tuning paramter values maximum tree depth = 3 and number of trees = 150 gave the best results. Next we adjust the tuning parameter grid search to see if we can find a better model. Based on the first training run, we are interested to see if performance keeps increasing with larger tree depth trees and with more trees. So we extend of grid of tuning parameters in both those dircetions.

gbmGrid <-  expand.grid(interaction.depth = c(3, 6, 9), 
                        n.trees = c(100, 200, 300, 400),
                        shrinkage = 0.1,
                        n.minobsinnode = 10)

nrow(gbmGrid)

## [1] 12

So this paramter space grid search will try 12 different combinations of hyper parameters.

if(!file.exists("gbmFit2.rds")) {
  gbmFit2 <- train(classe ~., data=train_reduced, method="gbm", trControl = fitControl, tuneGrid = gbmGrid)
  saveRDS(gbmFit2, "gbmFit2.rds")
} else {
  gbmFit2 <- readRDS("gbmFit2.rds")
}
gbmFit2

## Stochastic Gradient Boosting 
## 
## 15699 samples
##    48 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 12559, 12558, 12560, 12561, 12558, 12559, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   3                  100      0.9436777  0.9287301
##   3                  200      0.9724695  0.9651715
##   3                  300      0.9826997  0.9781156
##   3                  400      0.9871968  0.9838049
##   6                  100      0.9765082  0.9702799
##   6                  200      0.9890695  0.9861740
##   6                  300      0.9920379  0.9899288
##   6                  400      0.9931207  0.9912984
##   9                  100      0.9865343  0.9829672
##   9                  200      0.9925092  0.9905250
##   9                  300      0.9938723  0.9922494
##   9                  400      0.9946876  0.9932806
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 400,
##  interaction.depth = 9, shrinkage = 0.1 and n.minobsinnode = 10.

ggplot(gbmFit2)

Again the most complex model (maximum tree depth and number of trees) performed the best. However all the accuracies above 0.99 are pretty similar. Since increased model complexity has a greater danger of overfitting, we could decide to choose a simpler model, for instance the depth 6 one with 250 trees.

Variable importance

Another useful property of boosted trees is the variable importance estimate, which makes the model a bit more interpratable. From the 48 variables, we see the relative importanc of the top 10 visualsed below.

plot(varImp(gbmFit2), top=10)

Out of intereset, let us visually inspect how well the classes are separable using the two most important variables

qplot(x=roll_belt, y=yaw_belt, data=train_reduced, color=classe)

Even when visualising with the two most important variables, we observe a large amount of overlap between classes. This confirms the importance of combining more than two variables in the final model (as seen in models with higher interaction depth performing better).

Expected out of sample error

To estimate the out of sample error, we use our held out test data.

confusionMatrix(data = predict(gbmFit2, newdata=test), reference = test$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1115    2    0    0    0
##          B    1  756    3    0    0
##          C    0    1  679   10    2
##          D    0    0    2  632    1
##          E    0    0    0    1  718
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9941          
##                  95% CI : (0.9912, 0.9963)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9926          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9960   0.9927   0.9829   0.9958
## Specificity            0.9993   0.9987   0.9960   0.9991   0.9997
## Pos Pred Value         0.9982   0.9947   0.9812   0.9953   0.9986
## Neg Pred Value         0.9996   0.9991   0.9985   0.9967   0.9991
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1927   0.1731   0.1611   0.1830
## Detection Prevalence   0.2847   0.1937   0.1764   0.1619   0.1833
## Balanced Accuracy      0.9992   0.9974   0.9943   0.9910   0.9978

We observe very good performance on the held out test dataset. Accuracy of 99% is statistially significantly better than the no information rate of 28% (i.e. what you would achieve with random guessing). We expect a 99% out of sample accuracy, when any of the same 6 users perform the same actions again.

However, we should not assume that this means the model will generalise with such high accuracy to new users performaning the experiments. Recall that there are only 6 participans and our both training and test set used data form all participants. If we want to get a better idea how the a model trained on one set of users generalises to a new user we would need a different train/test data split.

Appendix

Exploration of class balance

We want to predict the “classe” variable (the manner in which participants did the exercise). First we look for an imbalance in outcome classes. Looking at the spread of the training data observations we see that there are more observations in class A (the correct class) than in classes B-E:

summary(train$classe)

##    A    B    C    D    E 
## 4464 3038 2738 2573 2886

Training data that is somewhat unbalanced in number of observations between the outcome classes might bias our classifier. We will keep this in mind when interpreting our results.

We could try to compensate for the unbalanced training data, by randomly discarding data from the classes having more observations than class D (2573).

trainBalanced <- train %>% group_by(classe) %>% sample_n(2573)
summary(trainBalanced$classe)

##    A    B    C    D    E 
## 2573 2573 2573 2573 2573

This naive approach to balancing is thowing away potentially useful training data. Due to the good results we could achieve on the train set with cross validation, we did not need to consider balancing.

Plots of independant variables vs outcome

We investigated all the variables, but for brevity only include 5 examples here.

library(ggplot2)
to_plot <- setdiff(colnames(train_reduced), "classe")
for (i in to_plot[1:5]) {
  print(ggplot(train_reduced, aes_string(x="classe", y=i, colour="classe")) + geom_boxplot())
}

Prediction assignment : are participants lifting barbells correctly?