Motive/Background
The following is my submission for the Practical Machine Learning course project under John Hopkins University’s 10-course Data Science Specialization. From the prompt of the assignment:
“One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways (…) The goal of your project is to predict the manner in which they did the exercise. This is the ‘classe’ variable in the training set. You may use any of the other variables to predict with.”
Data
The data used for this project was sourced from the Human Activity Recognition dataset (Ugulino et al, 2012): http://groupware.les.inf.puc-rio.br/har.
Writeup
Click here to view the full RPubs writeup/report for this project with all of the included code. The rest of this document will summarize/restate what is already on that page.
Loading & Reading Data
To begin, lets load in some of the packages that we will be using:
Now, lets create a directory and download our dataset to it:
Now that our data files are located in our directory, we can read them into our own objects named train
and test
, representing our training set and test set, respectively:
Data Cleaning/Processing
Previewing the Data
Lets run some functions to simply get an idea of how our data looks:
As you can see, both of our sets contain 160 columns/features, with the training set containing 19622 rows and the test set only containing 20 rows.
Taking a look at the first few rows of our training set (only showing the first 16 columns):
Right off the bat you can see some descriptive variables that wont be necessary with respect to our model. Additionally, there are several columns that contain NA values.
Data Cleaning
Since the first 7 columns are irrelevant to our model (since we are only concerned with the gyroscope/accelerometer readings to train our model on), we will simply remove them:
We will now check for features that have a near zero variance. Since variables with very low variability do not contribute to our prediction model to a meaningful degree, we can completely forego these features and remove them from our datasets:
Since there are many columns that contain many NA values, we will simply remove all of these features that contain missing values, as that is necessary for our algorithm later down the line:
Lets see how many columns we are left with:
We went from 160 features to now only 53! There was clearly a lot of noise and unnecessary data for our purpose, but that’s always better than the alternative (not having enough data).
Creating Validation Set
We will now also partition our main training set into another train set and validation set. We will use this validation set later to test our model accuracy on, before finally predicting classe
values on the test set.
Visualizing Feature Correlation
Lets make a heat map to visualize the correlation between our variables:
As you can see, there’s definitely some correlation between some of the pertinent variables that will be fed into our model. For instance, features such as the acceleration of the belt in the z direction are also correlated with the total acceleration of the belt, the “roll” of the belt, and the movement of the belt in other directions as well. Additionally, things such as the movement of the forearm are also correlated with the movement of the dumbbell, as would be expected.
Creating our Model
Finally, lets create our model. I chose to use the random forest algorithm mainly due to its highly accurate nature when it comes to predicting non-linear relationships. Additionally, I used 5-fold cross validation by passing the method = "cv", 5
argument to the trainControl
function:
Testing Prediction Accuracy on Validation Set
Lets now test our model on our validation set to see how accurate the predictions are:
As you can see, we have a prediction accuracy of 1 (100%). Every single prediction was correct when applying the model to our validation set. Conversely, this means that our expected out of sample error is 0%. This could mean that our model may be over-fitted and perhaps too accurate, however, considering that the objective of this project is to predict values based off of a subset of the same exact data that we are training our model on, I would say our main goal of model accuracy is certainly achieved in this context.
Test Set Final Predictions
Finally, predicting the 20 classe
values of the test set:
The last 20 predictions of the test set are part of the graded quiz portion of this assignment on Coursera, and thus not visible on here. However, you’ll just have to take my word that every prediction listed is indeed correct!