Motive/Background
This NLP (Natural Language Processing) text prediction app was built as part of the final capstone project for the ten-course Data Science Specialization offered by John Hopkins University. As the final project, students were challenged to delve into an unfamiliar area of data science that was not previously taught in any of the courses prior. As stated in the instructions for the assignment on Coursera:
"You will use all of the skills you have learned during the Data Science Specialization in this course, but you'll notice that we are tackling a brand new application: analysis of text data and natural language processing. This choice is on purpose. As a practicing data scientist you will be frequently confronted with new data types and problems. A big part of the fun and challenge of being a data scientist is figuring out how to work with these new data types to build data products people love."
Overall, I had a great (and challenging) time diving into the world of NLP in order to fully learn the nuances and main challenges/steps in order to create a working text processing algorithm. All in all, it took me around three weeks of on-and-off work to fully complete this project, starting from having zero baseline knowledge of NLP.
App
Click the link below to access and try out the app yourself!
Additionally, there is an entire “App Info” section within the app that goes into more detail as to how the algorithm works.
Preview
Below is a screenshot preview of the app for convenience:
Code
Click here to view the GitHub repository for the project. Descriptions of each file are listed in the linked “README” file.
Milestone Report: Exploratory Analysis of the Text Data
About halfway through the project, students were also instructed to create a report that analyzes some of the main key features of the text data that was provided.
Summary
The following is a milestone report summarizing the exploratory analysis and steps taken thus far to prepare for building an NLP text prediction model/app. As part of the capstone project for the Data Science Specialization offered by John Hopkins University, students are instructed to use web-scraped data provided by SwiftKey in order to build a Shiny app which predicts the next word of a sentence given a few words of input (similar to modern day predictive keyboard features found on iPhones, for example).
Loading/reading in Data
The first step of our analysis begins with retrieving the given datasets and loading them into our R environment. As stated previously, the datasets are provided by courtesy of SwiftKey, who have scraped web data from three sources: blog sites, news sites, and Twitter. The data from each of these sources are contained in their separate, corresponding .txt files. Furthermore, we will only be working with the “en_US” locale datasets for this project. The data is provided to students from a download link; in order to preserve reproducibility in our analysis, we will download the dataset using this link directly within our R script.
From there, we will assign each dataset to its own corresponding object/character vector using readLines
Lets now run some quick summary statistics on each dataset.
Blogs Dataset
As you can see, the blogs dataset contains 899,288 total lines, 37,546,250 total words, and an average of ~41.8 words per line.
News Dataset
The news dataset contains 1,010,242 total lines, 34,762,395 total words, and an average of ~34.4 words per line.
Twitter Dataset
Lastly, the Twitter dataset contains 2,360,148 total lines, 30,093,372 total words, and an average of ~12.8 words per line. Notice the discrepancy in the average words per line in the Twitter dataset compared to the blogs and news datasets; this is largely due to the 140 character limit that tweets were limited to (at the time the data was collected).
Data Processing
Since each dataset contains up to millions of lines, we will choose to randomly sample 100,000 lines from each dataset when performing our exploratory analysis. The main reason for this is to simply reduce the loading/processing times of some of the tokenizing functions that we will be using later (especially when creating bigram and trigram tokens that increase the dataset sizes substantially). By randomly sampling thousands of lines from each dataset, we can still get an accurate representation of the larger population that we are sampling from.
Now that we have a sample of each dataset, we will analyze the relationship and frequency of word pairs by their n-grams. More specifically, we will analyze the top 10 most frequent unigrams (single words), bigrams (two word pairs), and trigrams (three word pairs). We will be using the tm
and tokenizers
packages to create the tokens and load them into dataframes.
First, we will create sub-tables (that we will later feed to gt
tables for visualization purposes) of frequent n-gram counts of each dataset. These sub-tables will then be column-binded/merged together into one aggregate table for each n-gram.
Lets now create an aggregate table named topUnigrams
that lists the top 10 most frequent unigrams (words) in each dataset, with their respective share.
We can now create the same exact aggregate table for bigrams named topBigrams
Lastly, we will create the same table for trigrams named topTrigrams
Exploratory Analysis
Now that we have our topUnigrams
, topBigrams
, and topTrigrams
tables, we can pipe them into gt
functions from the gt
package, used for creating aesthetically pleasing tables.
As you can see, the most frequent word in every dataset is by far “the”, with a share ranging from ~3-6% of all words across the datasets. Similarly, the next most common words are mostly all articles. This is to be expected, as articles indeed make up the vast majority of words in most texts of the English language.
Again, we see that many of the top bigrams involve combinations of common articles, as observed when looking at the top unigrams. Interestingly, we can see a slight difference in some of the bigrams in the Twitter dataset such as “thanks for” and “i love”, which to me appears to show the more “personal” component of social media platforms (i.e. the more frequent expression of opinions and emotions, more so than the impartial and formal tone of many news and blog writings).
Additionally, when observing the frequency of bigrams, one can instantly notice that the percent share of the top ten bigrams are significantly smaller than the percent share of the top ten unigrams. This makes sense and is to be expected if you understand the way tokenizers work when creating bigrams (and other higher n-grams) in this context. Two (or more) word pairs are much more distinct and less common relative to the entire dataset because you are only evaluating the occurrence of that exact word pair, as compared to examining each word on a case by case basis (which will inherently “cover” more of the dataset).
Once again, many of the top trigrams contain a mix of many common articles. This time, we can see even more of the difference between the Twitter dataset as compared to the blogs and news datasets. Here, we are able to the see that the phrase “thanks for the follow” appears to be among the most common for Twitter (with the trigrams “thanks for the”, “thank you for” and “for the follow” occupying three of the top ten most frequent trigrams).
Again, we also see that this time the percent share of each top word is essentially 0 for almost every trigram – with the exception of “thanks for the” making up 0.1% of all trigrams in the Twitter dataset. This again relates back to the point I mentioned earlier when speaking about bigrams.