Date

November 17, 2023

Motive/Background

The following is my submission for the Practical Machine Learning course project under John Hopkins University’s 10-course Data Science Specialization. From the prompt of the assignment:

“One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways (…) The goal of your project is to predict the manner in which they did the exercise. This is the ‘classe’ variable in the training set. You may use any of the other variables to predict with.”

Data

The data used for this project was sourced from the Human Activity Recognition dataset (Ugulino et al, 2012): http://groupware.les.inf.puc-rio.br/har.

NOTE: The link above has not worked for me and may be outdated/unmaintained as this course is now around a decade old. With that being said, there is a copy hosted on UC Irvine’s Machine Learning Repository with more information linked here.

Writeup

Click here to view the full RPubs writeup/report for this project with all of the included code. The rest of this document will summarize/restate what is already on that page.

Loading & Reading Data

To begin, lets load in some of the packages that we will be using:

‣

Code

Now, lets create a directory and download our dataset to it:

‣

Code

Now that our data files are located in our directory, we can read them into our own objects named train and test, representing our training set and test set, respectively:

‣

Code

Data Cleaning/Processing

Previewing the Data

Lets run some functions to simply get an idea of how our data looks:

As you can see, both of our sets contain 160 columns/features, with the training set containing 19622 rows and the test set only containing 20 rows.

Taking a look at the first few rows of our training set (only showing the first 16 columns):

Right off the bat you can see some descriptive variables that wont be necessary with respect to our model. Additionally, there are several columns that contain NA values.

Data Cleaning

Since the first 7 columns are irrelevant to our model (since we are only concerned with the gyroscope/accelerometer readings to train our model on), we will simply remove them:

‣

Code

We will now check for features that have a near zero variance. Since variables with very low variability do not contribute to our prediction model to a meaningful degree, we can completely forego these features and remove them from our datasets:

‣

Code

Since there are many columns that contain many NA values, we will simply remove all of these features that contain missing values, as that is necessary for our algorithm later down the line:

‣

Code

Lets see how many columns we are left with:

We went from 160 features to now only 53! There was clearly a lot of noise and unnecessary data for our purpose, but that’s always better than the alternative (not having enough data).

Creating Validation Set

We will now also partition our main training set into another train set and validation set. We will use this validation set later to test our model accuracy on, before finally predicting classe values on the test set.

‣

Code

Visualizing Feature Correlation

Lets make a heat map to visualize the correlation between our variables:

As you can see, there’s definitely some correlation between some of the pertinent variables that will be fed into our model. For instance, features such as the acceleration of the belt in the z direction are also correlated with the total acceleration of the belt, the “roll” of the belt, and the movement of the belt in other directions as well. Additionally, things such as the movement of the forearm are also correlated with the movement of the dumbbell, as would be expected.

Creating our Model

Finally, lets create our model. I chose to use the random forest algorithm mainly due to its highly accurate nature when it comes to predicting non-linear relationships. Additionally, I used 5-fold cross validation by passing the method = "cv", 5 argument to the trainControl function:

‣

Code

Testing Prediction Accuracy on Validation Set

Lets now test our model on our validation set to see how accurate the predictions are:

As you can see, we have a prediction accuracy of 1 (100%). Every single prediction was correct when applying the model to our validation set. Conversely, this means that our expected out of sample error is 0%. This could mean that our model may be over-fitted and perhaps too accurate, however, considering that the objective of this project is to predict values based off of a subset of the same exact data that we are training our model on, I would say our main goal of model accuracy is certainly achieved in this context.

Test Set Final Predictions

Finally, predicting the 20 classe values of the test set:

The last 20 predictions of the test set are part of the graded quiz portion of this assignment on Coursera, and thus not visible on here. However, you’ll just have to take my word that every prediction listed is indeed correct!

Predicting Exercise Technique Using Random Forest Algorithms

Motive/Background

Data

Writeup

Loading & Reading Data

Data Cleaning/Processing

Previewing the Data

Data Cleaning

Creating Validation Set

Visualizing Feature Correlation

Creating our Model

Testing Prediction Accuracy on Validation Set

Test Set Final Predictions