In this report, we present a process of building a model to distinguish whether an weight liftging motion is correct for exercise specification or not by the signals from four sensors. We used data provided by coursework, which is cited from http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises. We build a random forest model whose out of sample error is expeced to be low.

The dataset provided is consist of sensor data, whether the exercise activity is correct or how it is incorrect (classe), the target id, etc. Practical Machine Learning course provided the dataset into two parts, pml-training.csv and pml-testing.csv. In pml-testing.csv data set, there is no classe variable, in order for us to predict the activity classe. Therefore, we use only pml-training.csv data to build a model and model testing.

Data preparation and clensing

We first clean up the given data since there are so many blank values and unrelated to the activity classe. from the first column to seventh columns are the data about the data gathering environment. So, we can remove it safely. And we remove the columns with NAs in the first row.

library(caret)
pml <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
pml <- subset(pml, select = -c(1:7))
pml <- pml[,apply(pml[1,], 2, function(x) {!is.na(x)})]

Training and testing set

We split the pml-training data into training set and test set to make a model and test. We split it by 3/4 for training and 1/4 for testing.

set.seed(123)
inTrain <- createDataPartition(pml$classe, p=0.75, list=FALSE)
training <- pml[inTrain,]
testing <- pml[-inTrain,]

Cross validation setting and training

We set 3-fold cross validation for training control. And we build random forest model from the training set above.

set.seed(123)
tc <- trainControl(method = "cv", number = 3, repeats = 3 )
rfFit3 <- train(classe ~ ., data=training, trControl = tc, method="rf")

In sample error rate

After training step we can find the accuracy of the model built as follows.

rfFit3$results
##   mtry  Accuracy     Kappa   AccuracySD      KappaSD
## 1    2 0.9890609 0.9861611 0.0018505420 0.0023403443
## 2   27 0.9891290 0.9862482 0.0004222084 0.0005337559
## 3   52 0.9817908 0.9769637 0.0020942570 0.0026502308

In-sample error rate is (1 - accuracy). Generally, in-sample error rate is smaller than out-of-sample error. Therfore, we expect that out-of-sample error would be over 0.010871.

Test the model and out-of-sample error

We test the random forest model using the training set prepared eailer.

rfPred3 <- predict(rfFit3, newdata = testing)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
cm <- confusionMatrix(rfPred3, testing$classe)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    5    0    0    0
##          B    1  942    8    0    0
##          C    0    2  842    8    0
##          D    0    0    5  794    1
##          E    0    0    0    2  900
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9935          
##                  95% CI : (0.9908, 0.9955)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9917          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9926   0.9848   0.9876   0.9989
## Specificity            0.9986   0.9977   0.9975   0.9985   0.9995
## Pos Pred Value         0.9964   0.9905   0.9883   0.9925   0.9978
## Neg Pred Value         0.9997   0.9982   0.9968   0.9976   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1921   0.1717   0.1619   0.1835
## Detection Prevalence   0.2853   0.1939   0.1737   0.1631   0.1839
## Balanced Accuracy      0.9989   0.9952   0.9912   0.9930   0.9992

Surprisingly, the out-of-sample error rate is 0.0065253 which is lower than the in-sample error rate.

Answer the question

Now we can answer the question as follows. To protect the coursework system we do not reveal the answers. As a result, our answers are all correct, which means the error rate is zero!!

problems <- read.csv("pml-testing.csv")
answers <- predict(rfFit3, newdata = problems)

Conclusion

We have presented a model builing process. We have gotten random forest model with very high accuacy.

Appendix

Learning in parallel

To speed up the machine learning procedures, you may enable parallel features of R as follows.

library(doParallel); library(caret)
registerDoParallel(cores = detectCores())

Important features

As a result of random forest learning, we could get 20 most important features (predictors) as follows.

varImp(rfFit3)
## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt             100.00
## pitch_forearm          60.98
## yaw_belt               57.98
## magnet_dumbbell_z      45.64
## pitch_belt             45.15
## magnet_dumbbell_y      44.08
## roll_forearm           43.41
## accel_dumbbell_y       21.63
## roll_dumbbell          19.55
## magnet_dumbbell_x      18.08
## accel_forearm_x        16.85
## magnet_belt_z          15.95
## accel_dumbbell_z       14.61
## accel_belt_z           14.36
## total_accel_dumbbell   14.32
## magnet_forearm_z       13.26
## magnet_belt_y          13.16
## gyros_belt_z           12.06
## yaw_arm                10.73
## magnet_belt_x          10.15

The graph of first two features and outcome classe variable shows that there may be a way to make simple model.


  1. This report is a result of the assingment of Practical Machine Learning course in Coursera.