# Motive/Background

The following is my submission for the Regression Models course project under the John Hopkins University’s Data Science Specialization. The prompt for this project is as follows:

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

- “Is an automatic or manual transmission better for MPG?”
- “Quantify the MPG difference between automatic and manual transmissions”

In this report, **we find that there is some evidence to suggest that manual transmissions get better mileage than automatics** within our limited dataset. Additionally, **we found that a car having a manual transmission results in an increase of ~1.48 miles per gallon (all else equal) in our linear model**. With that being said, a much more thorough analysis (with larger sample sizes) is necessary to truly conclude such results with confidence.

# Writeup

**Click here** to view the full RPubs writeup/report for this project with all of the included code. The rest of this document will summarize/restate what is already on that page.

# The Data

The ‘mtcars’ dataset (Motor Trend Car Road Tests) from the R ‘datasets’ library was used for this analysis. Lets begin by loading in some helpful/necessary packages, as well as our dataset:

Previewing the first few rows of the dataset:

Now that we have a preview of our data, let’s break down what each variable represents:

**mpg**: Miles/gallon**cyl**: Number of cylinders**disp**: Engine displacement (cubic inches)**hp**: Gross horsepower**drat**: Rear axle ratio**wt**: Weight (1000 lbs)**qsec**: Quarter mile time**vs**: Cylinder arrangement/configuration (0 = V-Shaped, 1 = Straight)**am**: Transmission (0 = Automatic, 1 = Manual)**gear**: Number of forward gears**carb**: Number of carburetors

**Preliminary Brainstorming: Thinking About our Variables in Relation to Modeling**

When creating a model where there are several variables/regressors, it is important to follow the general idea of forgoing unimportant (uncorrelated) variables, while keeping the ones that are presumed to have a much higher significance/correlation to the response variable. In this case, we will eventually need to create a model that takes into account the variables which contribute the most to MPG (as well as our transmission variable). With that being said, we also want to avoid collinearity issues in our model, in which our predictor/regressors are too correlated to one another.

Before we even begin an exploratory analysis, let’s first try to understand which key variables will be of the most interest to us, and then focus on those. I will now go one by one and explain my reasoning and thought process about each variable and how it relates to our model. Some of these variables will be more much more obvious to include/exclude than others, so for now we will only briefly explain each variable. We will decide which ones to keep when experimenting with model design later in the report.

: In general, the number of cylinders that a vehicle has is typically correlated to the miles per gallon that you can expect to achieve. In each cylinder, there must be a certain air fuel ratio present in order for combustion to properly occur as intended. More cylinders generally means that more fuel is necessary to “power” the entire engine (since each cylinder needs a necessary amount of fuel), which ultimately affects MPG. While this has largely changed in recent years with modern engines (largely due to the help of forced induction), this general trend seems to hold true more often than not.__cyl__: Engine displacement can be viewed as a measure of the total volume being displaced from the stroke of each piston in its cylinder. The larger the displacement, the “larger” the engine is (either due to larger cylinders/pistons, more cylinders, or both). Regardless of which component is leading to a larger displacement, the larger the engine is, the more fuel is typically required to run it. This variable may be largely correlated to our other variables of interest (most notably cyl), so we may or may not include it in our model.__disp__: Being a unit of power, horsepower generally measures the force enacted upon the crankshaft of an engine. What dictates the amount of power an engine makes relies upon three variables: air, fuel, and ignition/spark. While ignition is the least important in regards to making more power (you just need one spark, not “more” spark), the more fuel__hp__*and*air is able to combust in each cylinder, the more power the engine will make. As a result, an engine that makes more power (all else equal) will necessarily need a higher volume of fuel and air. (Once again, this has also changed with recent technology and advancements, but this general trend holds true much more often than not).: While higher rear axle gear ratios can negatively impact MPG, most of this concern revolves around trucks and large vehicles which tow heavy items. Here, a higher gear ratio will grant more torque to the wheels (and thus allow easier and more efficient towing) at the cost of fuel efficiency. With that being said, for the vast majority of “regular” cars (including the ones in our dataset), this concern is trivial and mostly irrelevant since the gear ratios have already been optimized from the factory. We will most likely omit this variable from our model.__drat__: Weight is definitely one of the largest factors that comes into play in regards to MPG. Simply put, the heavier a car is, the more force, power, and torque is necessary to move it. All of these components require an engine to use more fuel compared to a lighter car, all else equal.__wt__: A car’s quarter mile time is the time it takes for it to travel a quarter mile from a complete stop (full acceleration). This too shares a relationship with some of our other variables that we have examined thus far, namely horsepower. A car’s ability to reach a certain distance in a certain amount of time is largely correlated to how much power it can produce (how “fast” is it). Because of this, we can make the assumption that a higher amount of fuel is required to make more power in order to reach a certain distance faster. We may or may not keep this variable in our model, depending on its correlation (variance inflation factor) to our other variables.__qsec__: While cylinder arrangement is important for engine design, there is not as significant of a correlation to MPG in this regard. The amount of cylinders in this case would be much more relevant, not necessarily the way they are arranged (the location of the cylinders don’t really change how much air/fuel needs to be compressed). We will most likely omit this variable from our model.__vs__: Since our main question revolves around whether or not a manual transmission is more fuel efficient, we will obviously need to include this variable/regressor regardless. As far as I know, there used to be evidence to suggest that manual transmissions used to get better fuel economy than their automatic counterpart. Once again, this has largely changed in the past few decades, and it now stands as an automotive myth in the context of modern vehicles. With that being said, since our dataset contains older cars, we may very well witness a positive correlation to MPG with this variable.__am__: While the number of forward gears a car is equipped with also has an effect on mileage, compared to our other (more significant) variables, it does not hold as much weight in that regard. We will most likely omit this variable from our model.__gear__: The number of carburetors a car has can also affect MPG, but once again, pales in significance to some of our other more important variables. We will most likely omit this variable from our model as well.__carb__

# Exploratory Analysis

Now that we have a general idea of which variables we deem as important/significant, we can examine them individually to see what our data says about it. An easy and intuitive way to visualize these distributions is through boxplots.

**Examining Our Main Regressor: Transmission Type**

Before we look at every other significant variable of interest, lets first start off by analyzing our main variable of interest as it pertains to MPG. Let’s create a boxplot that plots the MPG by transmission type:

As you can see, there definitely seems to be a positive correlation to MPG by transmission type. More specifically, it seems as though manual transmissions get better fuel economy than automatics.

While we’re here, lets go ahead and run some hypothesis tests to see whether or not there actually is a statistically significant difference in MPG means between the transmission types. Running a T-Test with 95% confidence interval:

As you can see, we get a p-value of 0.001 (<0.05) which means we reject the null hypothesis that the mean MPG for automatic transmissions and manual transmissions are the same. In other words, the difference in means between these groups is statistically significant. Alas, this is only a simple T-Test based on a limited dataset (and only examining one variable), so lets continue onwards with our analysis before we conclude anything about this result.

**Examining Other Significant Regressors**

Now lets examine the rest of the variables that may play a significant role in predicting MPG for our model. As stated previously in the brainstorming process, there are several variables I believe will have a large correlation to MPG (and perhaps to each other predictor as well). We will only focus on cylinders, horsepower, displacement, and weight for the following boxplot visualizations.

Before creating the plots, I created a new dataframe named `data`

which is essentially just a copy of our `mtcars`

dataset, but with three new “bin” columns added to it. Since the *hp*, *disp*, and *wt* variables contain a relatively wide distribution of continuous values, I wanted to split each variable into three different bins, so that each variable would only be plotting over three ranges (otherwise, there would simply be way too many boxes):

Now that we have these new columns created, we can feed them into our ggplot functions to properly plot them using the bins. Creating the cylinder plot and assigning it to an object:

Creating the horsepower plot and assigning it to an object:

Creating the displacement plot and assigning it to an object:

Creating the weight plot and assigning it to an object:

Now finally plotting all the plots together:

As you can see, there are definitely some observable trends with these variables in relation to MPG. An increase in cylinders seems to be correlated with a lower MPG (more cylinders = more fuel needed for combustion to occur throughout the engine) as predicted earlier. Additionally, higher horsepower, displacement, and weight all seem to share this negative correlation with MPG as well.

Let’s now move onto creating our model so that we can gain an even clearer understanding of the correlation that these variables have with MPG.

**Model Selection & Creation**

**Model 1: Including Only Transmission Type as a Predictor**

To begin, lets create a simple linear model using *mpg* as the outcome, and *am* (transmission) as the predictor.

Lets check out a summary of this model:

Right off the bat, we see that we have an R-squared value of ~0.36. Only around 36% of variation in MPG is being explained by the transmission type. Of course, this is a pretty low R-squared value that we ideally want to see much higher in the context of our data. This was completely expected, however, since we are only examining one single variable in this model. Additionally, our intercept is ~17.1 and our regressor coefficient is ~7.2. This means that under this model, a car with a “transmission value of zero” (automatic transmission) will predict a value of ~17.1 MPG (all else equal). Additionally, with a one-unit “increase in transmission” (going from automatic to manual), you can expect MPG to increase by ~7.2.

**Model 2: Including Every Variable as a Predictor**

Now lets do the opposite of the last model and include every variable into our model instead:

We now have a much larger R-squared value; around ~87% of variation in MPG can be explained by all of the variables included in our dataset. It is important to note, however, that we are NOT simply chasing after higher R-squared values (relying simply on R-squared can be severely misleading or irrelevant in many contexts). Many of these variables may very well be redundant or insignificant for our model – as we will soon see.

Notice also that when we include every single variable, none of them are statistically significant with a 95% confidence interval (as seen by “*” markings next to the variables in the summary call).

Another concept that we will now examine closer is the problem of collinearity. From an article written by Zack on statology.org:

Multicollinearity in regression analysis occurs when two or more predictor variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.

An important tool to examine how much collinearity certain variables have with each other is the variance inflation factor (VIF). The VIF is a number bounded from [1, ∞]; the higher the number, the more collinearity is present between the variables. Generally, values from 1-5 are considered an acceptable/moderate range of correlation, while values larger than that (some people say 10+) are typically considered to exhibit high levels of correlation.

Lets check out the VIF for this model:

As you can see, we have several variables with very high VIFs, most notably the *disp *variable with a VIF of ~21.6. Clearly, it is not enough to simply stuff every single variable in our model and call it a day.

Lets now create and fine tune a new model, this time by omitting less important variables.

**Model 3: Removing Less Important Variables**

Lets now create a model without the *carb*, *gear*, *qsec*, *vs*, and *drat* variables (the ones I hypothesized to be of least importance earlier):

We can now already see some more statistical significance with some of our variables, namely *wt* within a 95% confidence interval, and *hp* within a 90% confidence interval. Additionally, our R-squared value decreased slightly (.869 to .855), however as discussed previously, this is nothing really to be concerned about, especially when we are making our model much more relevant and parsimonious.

Lets now check our VIF values and see what has changed since the last model:

As you can see, the VIF values are much lower now, which is certainly a good sign. With that being said, *disp* is still around ~10.4 and *cyl* is around ~7.2. Lets see if we can get even better results by omitting *disp* from our model, since this seems to be highly correlated with our other variables (most likely with *cyl*, since the more cylinders a car has the more volume it typically displaces).

**Final Model: Removing ****disp**

**disp**

Lets now see what we get when we remove the *disp* variable from our model:

And checking the VIF:

Great! Our VIFs are now at acceptable levels in the context of our question/model. Additionally, our R-squared is still at an acceptable value of ~0.85. Taking this into consideration, I am pleased with using this version as our final model.

In summary, we have chosen four variables to examine the affect of a car’s MPG in this model: the cylinder count, amount of horsepower, the weight, and transmission type. **With an intercept of ~36.1, this means that – all else equal – a car that has 0 cylinders, 0 horsepower, weighs 0lbs, and has an automatic transmission (am = 0), will have a mean MPG of ~36.1**. Of course, this scenario is impossible, so we don’t need to place too much importance on the interpretation of this value in this context (indeed, this is often times the case depending on what model you have or what problem you are trying to solve).

As for our regressor coefficients:

**A 1 unit increase in cylinder (all else equal) equates to a ~0.75 decrease in MPG**.**A 1 unit increase in horsepower (all else equal) equates to a ~0.02 decrease in MPG**. This variable is also statistically significant with a 90% confidence interval.**A 1000lb increase in weight (all else equal) equates to a ~2.6 decrease in MPG**. This variable is also statistically significant with a 99% confidence interval.**A car having a manual transmission (all else equal) equates to a ~1.48****increase****in MPG**. This variable, however, is**not**statistically significant.

**Plotting Residuals**

Now, lets get a visual representation of our residuals and how they’re spread out. To do this, we will create a scatterplot which plots the residual values against the fitted values of the model. Ideally, we are looking for residuals that center themselves relatively evenly around the identity line (the dotted line at y = 0). Any drastic variation/spread in one direction or another (or spreads that display a certain pattern) typically indicates that the given model is inappropriate for modeling the given data.

First, we will use the `augment()`

function from the `broom`

library to transform our final model into a workable dataframe, this way we can easily plot certain features in ggplot:

As you can see, there definitely *is* some variation/skew around the identity line (which is inevitable), but it does not seem drastic or significant in this regard. It appears as though there *may* be some signs of heteroskedasticity towards the higher values, but once again, not to any considerable extent.

**Q-Q Plot**

As one of our last steps, we will now create a Q-Q plot (quantile-quantile plot) which compares two different probability distributions by plotting the sample quantiles against the theoretical quantiles. Ideally, we would like to see our data points line up against the straight, diagonal identity line (in the case of normal distributions) as much as possible. Deviations from the identity line can indicate departures from the assumed distribution, providing a visual cue for determining normality or other distributional assumptions.

Since ggplot unfortunately has no built-in method to calculate the parameters of the Q-Q line, I had to find a function that passes the linear model as an argument, and returns the desired results back in a ggplot. Credit goes to Peter and Aaron from this stackoverflow post for the function that I am using (with tweaks):

As you can see, most of our values are spread out fairly close to the identity line, with some variation towards the larger quantiles. This is to be expected, especially considering our data and the small sample size that we are working with.

**Conclusion/Answers**

With our final model created and set, lets now answer the questions we set out to analyze.

**Is an automatic or manual transmission better for MPG?**

**By only examining the transmission variable against MPG using a T-Test**, we have found that **there is statistical significance (with a 95% confidence interval) that the means between MPG for automatic and manual transmissions are not equal**.

With that being said, **within our linear model** which predicts MPG against the number of cylinders, horsepower, weight, and transmission type of a car, **we found that the transmission variable is ****not**** statistically significant**. **In other words, there are far more significant variables that come into play when determining the amount of miles per gallon a car can achieve given our model**. Still, our transmission regressor coefficient tells us that there is indeed an increase in MPG given a manual transmission (all other variables held equal).

Overall, there is some evidence to suggest that manual transmissions get better mileage than automatics within our limited dataset, however, much more thorough analyses must be done to truly conclude such results with confidence.

**Quantify the MPG difference between automatic and manual transmissions**

Within the context of our final linear model, **we found that a car having a manual transmission results in an increase of ~1.48 miles per gallon (all else equal)**.