Getting and Cleaning Data Course Project

Getting and Cleaning Data Course Project

Date
September 13, 2023
Tags
R

Motive/Background

The following is my submission for the Getting and Cleaning Data course project under John Hopkins University’s Data Science Specialization. This assignment was presented to students as a challenge to work with large, messy plain text (.txt) files from accelerometers from Samsung Galaxy S smartphones recording physical activity movements (similar to devices like Fitbit or other types of "wearable computing"). Broadly speaking, the main goal of the assignment is for students to demonstrate their ability to retrieve, transform, and clean data into a tidy and workable format for downstream analysis.

Data

The dataset and tables used have been sourced from the following publication:

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012

As per the source:

"The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.”

Additionally, for each record/row, we are provided with:

  • An identifier of the subject who carried out the experiment
  • The activity label
  • Triaxial acceleration from the accelerometer and the estimated body acceleration
  • Triaxial angular velocity from the gyroscope
  • A 561-feature vector with time and frequency domain variables

Instructions

There are 5 main steps/tasks to complete:

"You should create one R script called run_analysis.R that does the following:
  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject."

Writeup + Code Book

Click here to view the original GitHub code book file for this project which includes the complete R script used. The rest of this document will summarize/restate what is already on that page.

Transformation & Cleaning Process

1) Merging Training and Test Sets to Create One Data Set

First, our libraries are loaded in and the data is read into our RStudio environment using the read.table() function. Each table is loaded into its own dataframe with the same name as its corresponding plain-text file name. For this assignment we are only concerned with the subjectx, and y tables of both the test and train sets (6 tables total).

library(tidyverse)
library(magrittr)

setwd("~/Desktop/Data/Data Science Specialization Course/Getting and Cleaning Data Course/Week 4 Project Assignment/UCI HAR Dataset/test")
subject_test <- read.table("subject_test.txt")
x_test <- read.table("X_test.txt")
y_test <- read.table("y_test.txt")

setwd("~/Desktop/Data/Data Science Specialization Course/Getting and Cleaning Data Course/Week 4 Project Assignment/UCI HAR Dataset/train")
subject_train <- read.table("subject_train.txt")
x_train <- read.table("X_train.txt")
y_train <- read.table("y_train.txt")
Loading in packages + data

Now that our tables are loaded in, we will begin to start the process of merging them all together (we are actually moreso column-binding and row-binding rather than merging).

Before we do that, we will first standardize and rename the subject and y table column names for both sets. Since there are no column names/header for the original datasets, R will automatically give them placeholder names (V1, V2, V3, ...) by default.

Since we will be column binding the tables together, there would otherwise be multiple "V1" columns between the tables, so this step is taken to negate any confusion from the get go.

subject_test %<>% rename(SubjectID = V1)
y_test %<>% rename(Activity = V1)

subject_train %<>% rename(SubjectID = V1)
y_train %<>% rename(Activity = V1)
Renaming column names for the subject and labels (y) tables

Now, our tables are ready to be merged. First, all of the test tables get column-binded into one, aggregate test dataframe (test_df). Likewise, the all of the train tables also get column-binded into its own aggregated train dataframe (train_df). Finally, train_df and test_df get row-binded, resulting in one final dataframe of all of the relevant data named data.

Additionally, I also clear the environment of every other table not in use to free up some memory.

test_df <- cbind(subject_test, y_test, x_test)
train_df <- cbind(subject_train, y_train, x_train)
data <- rbind(train_df, test_df)
rm(test_df, train_df, subject_train, subject_test, x_test, x_train, y_test, y_train)
Column-binding and row-binding all tables together

2) Extracting Only the Mean and Standard Deviation Measurements for Each Variable

As it currently stands, the new data table is a dataframe composed of 10299 observations and 563 variables. The first column is represented by the SubjectID, the second by the Activity, and the remaining columns are the 561-feature vector composed of time and frequency domain variables. Of these 561 columns, only some of them show "mean()" and "std()" measurements of certain variables. This step of the assignment requires us to narrow down the amount of variables/columns to only those who represent a mean or standard deviation value.

By looking at the features.txt file provided in the original dataset folder, we are able to see every column name of the 561-feature vector. From there, I noted which ones end in "...mean()" or "...std()" and selected each column by column number. This now leaves us with 68 total variables in data set.

data %<>% select(SubjectID, Activity, V1:V6, V41:V46, V81:V86, V121:V126, 
                 V161:V166, V201, V202, V214, V215, V227, V228, V240, V241,
                 V253, V254, V266:V271, V345:V350, V424:V429, V503, V504, 
                 V516, V517, V529, V530, V542, V543)
Extracting only the relevant columns that have mean and standard deviation measurements

3) Using Descriptive Activity Names to Name the Activities in the Data Set

Up until this point, the Activities column contains numeric values ranging from 1 to 6, each describing the different activities performed by the subject. From looking at the activity_labels.txt file provided in the original dataset folder, we are able to see which numeric values correspond to each activity. From there, we can use the mutate() function along with the case_when() function to change these values in the data set.

data %<>% mutate(Activity = case_when(Activity == 1 ~ "Walking",
                                      Activity == 2 ~ "Walking Upstairs",
                                      Activity == 3 ~ "Walking Downstairs",
                                      Activity == 4 ~ "Sitting",
                                      Activity == 5 ~ "Standing",
                                      Activity == 6 ~ "Laying"))
Replacing numeric activity labels with their corresponding qualitative descriptions

4) Appropriately Labeling the Data Set with Descriptive Variable Names

Since the column names of the 561-feature vector are still default at this point (V1, V2, V3, etc.), we will now seek to populate the column names with their correct descriptive variable names found in the features.txt file. This time, we will actually load in this features.txt file into its own dataframe, features. From there, we will then select the names of the same relevant columns from Step 2 and save it to a vector named variable_names.

setwd("~/Desktop/Data/Data Science Specialization Course/Getting and Cleaning Data Course/Week 4 Project Assignment/UCI HAR Dataset")
features <- read.table("features.txt")
variable_names <- features[c(1:6,41:46,81:86,121:126,161:166,201,202,214,215,227,228,240,
                             241,253,254,266:271,345:350,424:429,503,504,516,517,529,530,542,543), 2]
Reading in the "features.txt" file to save all the variable names to a vector

We now have a vector of all of the corresponding variable names for the exact columns that we are working with in the data data set.

These names, however, are not entirely tidy and contain several ambiguities and special characters (ex: "tBodyAcc-mean()-X"). To fix this, we will clean the variable names using str_replace() and regular expressions. Specifically, we will be: replacing the "t" prefixes with "Time", replacing the "f" prefixes with "Freq", replacing "mean()" with "Mean" and "std()" with "Std", and removing all dashes.

variable_names %<>% str_replace(pattern = "^t", replacement = "Time") %>%
                    str_replace(pattern = "^f", replacement = "Freq") %>%
                    str_replace(pattern = "mean\\(\\)", replacement = "Mean") %>%
                    str_replace(pattern = "std\\(\\)", replacement = "Std") %>%
                    str_replace_all(pattern = "\\-", replacement = "")
Cleaning and appropriately naming variable names

Now that our variable_names vector has the final version of all the column names, we can use it to rename the original columns in our data dataset.

colnames(data) <- c("SubjectID", "Activity", variable_names)
Replacing default variable names with corresponding descriptive variable names

As a final (optional) step to leave the data in its most organized and tidy form, we will sort the dataset by SubjectID and Activity.

data %<>% arrange(SubjectID, Activity)
Sorting by SubjectID and Activity

Done! Our dataset is now cleaned and tidy, ready for any potential downstream analysis that someone may want to use it for.

5) (From the Data Set in Step 4) Creating a Second, Independent Tidy Data Set With the Average of Each Variable for Each Activity and Each Subject

The final step is fairly self explanatory from reading the title. We can accomplish this by grouping by SubjectID and Activity and using summarise_all() in order to apply a function (in this case the mean) across all of the columns. We will assign this new resulting dataframe to an object named averages. The final product is a dataframe made up of 180 observations and 68 variables.

averages <- data %>% group_by(SubjectID, Activity) %>% summarise_all(mean)
Creating the final averages data set

Variable Descriptions & Naming Convention/Schema