In this report, we present a process of building a model to distinguish whether an weight liftging motion is correct for exercise specification or not by the signals from four sensors. We used data provided by coursework, which is cited from http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises. We build a random forest model whose out of sample error is expeced to be low.
The dataset provided is consist of sensor data, whether the exercise activity is correct or how it is incorrect (classe
), the target id, etc. Practical Machine Learning course provided the dataset into two parts, pml-training.csv
and pml-testing.csv
. In pml-testing.csv
data set, there is no classe
variable, in order for us to predict the activity classe. Therefore, we use only pml-training.csv
data to build a model and model testing.
We first clean up the given data since there are so many blank values and unrelated to the activity classe. from the first column to seventh columns are the data about the data gathering environment. So, we can remove it safely. And we remove the columns with NAs in the first row.
library(caret)
pml <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
pml <- subset(pml, select = -c(1:7))
pml <- pml[,apply(pml[1,], 2, function(x) {!is.na(x)})]
We split the pml-training
data into training set and test set to make a model and test. We split it by 3/4 for training and 1/4 for testing.
set.seed(123)
inTrain <- createDataPartition(pml$classe, p=0.75, list=FALSE)
training <- pml[inTrain,]
testing <- pml[-inTrain,]
We set 3-fold cross validation for training control. And we build random forest model from the training set above.
set.seed(123)
tc <- trainControl(method = "cv", number = 3, repeats = 3 )
rfFit3 <- train(classe ~ ., data=training, trControl = tc, method="rf")
After training step we can find the accuracy of the model built as follows.
rfFit3$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9890609 0.9861611 0.0018505420 0.0023403443
## 2 27 0.9891290 0.9862482 0.0004222084 0.0005337559
## 3 52 0.9817908 0.9769637 0.0020942570 0.0026502308
In-sample error rate is (1 - accuracy). Generally, in-sample error rate is smaller than out-of-sample error. Therfore, we expect that out-of-sample error would be over 0.010871.
We test the random forest model using the training set prepared eailer.
rfPred3 <- predict(rfFit3, newdata = testing)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
cm <- confusionMatrix(rfPred3, testing$classe)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 5 0 0 0
## B 1 942 8 0 0
## C 0 2 842 8 0
## D 0 0 5 794 1
## E 0 0 0 2 900
##
## Overall Statistics
##
## Accuracy : 0.9935
## 95% CI : (0.9908, 0.9955)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9917
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9926 0.9848 0.9876 0.9989
## Specificity 0.9986 0.9977 0.9975 0.9985 0.9995
## Pos Pred Value 0.9964 0.9905 0.9883 0.9925 0.9978
## Neg Pred Value 0.9997 0.9982 0.9968 0.9976 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1921 0.1717 0.1619 0.1835
## Detection Prevalence 0.2853 0.1939 0.1737 0.1631 0.1839
## Balanced Accuracy 0.9989 0.9952 0.9912 0.9930 0.9992
Surprisingly, the out-of-sample error rate is 0.0065253 which is lower than the in-sample error rate.
Now we can answer the question as follows. To protect the coursework system we do not reveal the answers. As a result, our answers are all correct, which means the error rate is zero!!
problems <- read.csv("pml-testing.csv")
answers <- predict(rfFit3, newdata = problems)
We have presented a model builing process. We have gotten random forest model with very high accuacy.
To speed up the machine learning procedures, you may enable parallel features of R as follows.
library(doParallel); library(caret)
registerDoParallel(cores = detectCores())
As a result of random forest learning, we could get 20 most important features (predictors) as follows.
varImp(rfFit3)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 60.98
## yaw_belt 57.98
## magnet_dumbbell_z 45.64
## pitch_belt 45.15
## magnet_dumbbell_y 44.08
## roll_forearm 43.41
## accel_dumbbell_y 21.63
## roll_dumbbell 19.55
## magnet_dumbbell_x 18.08
## accel_forearm_x 16.85
## magnet_belt_z 15.95
## accel_dumbbell_z 14.61
## accel_belt_z 14.36
## total_accel_dumbbell 14.32
## magnet_forearm_z 13.26
## magnet_belt_y 13.16
## gyros_belt_z 12.06
## yaw_arm 10.73
## magnet_belt_x 10.15
The graph of first two features and outcome classe variable shows that there may be a way to make simple model.
This report is a result of the assingment of Practical Machine Learning course in Coursera.↩