Date

January 10, 2024

Motive/Background

This NLP (Natural Language Processing) text prediction app was built as part of the final capstone project for the ten-course Data Science Specialization offered by John Hopkins University. As the final project, students were challenged to delve into an unfamiliar area of data science that was not previously taught in any of the courses prior. As stated in the instructions for the assignment on Coursera:

"You will use all of the skills you have learned during the Data Science Specialization in this course, but you'll notice that we are tackling a brand new application: analysis of text data and natural language processing. This choice is on purpose. As a practicing data scientist you will be frequently confronted with new data types and problems. A big part of the fun and challenge of being a data scientist is figuring out how to work with these new data types to build data products people love."

Overall, I had a great (and challenging) time diving into the world of NLP in order to fully learn the nuances and main challenges/steps in order to create a working text processing algorithm. All in all, it took me around three weeks of on-and-off work to fully complete this project, starting from having zero baseline knowledge of NLP.

App

Click the link below to access and try out the app yourself!

kenen.shinyapps.io

Additionally, there is an entire “App Info” section within the app that goes into more detail as to how the algorithm works.

NOTE: The app has been tested and confirmed working by multiple people at the time of writing this, but if for some reason the app does not work or output predictions properly, please let me know by emailing me at: corea.kenen@gmail.com

Preview

Below is a screenshot preview of the app for convenience:

Code

Click here to view the GitHub repository for the project. Descriptions of each file are listed in the linked “README” file.

Milestone Report: Exploratory Analysis of the Text Data

About halfway through the project, students were also instructed to create a report that analyzes some of the main key features of the text data that was provided.

Summary

The following is a milestone report summarizing the exploratory analysis and steps taken thus far to prepare for building an NLP text prediction model/app. As part of the capstone project for the Data Science Specialization offered by John Hopkins University, students are instructed to use web-scraped data provided by SwiftKey in order to build a Shiny app which predicts the next word of a sentence given a few words of input (similar to modern day predictive keyboard features found on iPhones, for example).

Loading/reading in Data

The first step of our analysis begins with retrieving the given datasets and loading them into our R environment. As stated previously, the datasets are provided by courtesy of SwiftKey, who have scraped web data from three sources: blog sites, news sites, and Twitter. The data from each of these sources are contained in their separate, corresponding .txt files. Furthermore, we will only be working with the “en_US” locale datasets for this project. The data is provided to students from a download link; in order to preserve reproducibility in our analysis, we will download the dataset using this link directly within our R script.

‣

Code

From there, we will assign each dataset to its own corresponding object/character vector using readLines

‣

Code

Lets now run some quick summary statistics on each dataset.

Blogs Dataset

As you can see, the blogs dataset contains 899,288 total lines, 37,546,250 total words, and an average of ~41.8 words per line.

News Dataset

The news dataset contains 1,010,242 total lines, 34,762,395 total words, and an average of ~34.4 words per line.

Twitter Dataset

Lastly, the Twitter dataset contains 2,360,148 total lines, 30,093,372 total words, and an average of ~12.8 words per line. Notice the discrepancy in the average words per line in the Twitter dataset compared to the blogs and news datasets; this is largely due to the 140 character limit that tweets were limited to (at the time the data was collected).

Data Processing

Since each dataset contains up to millions of lines, we will choose to randomly sample 100,000 lines from each dataset when performing our exploratory analysis. The main reason for this is to simply reduce the loading/processing times of some of the tokenizing functions that we will be using later (especially when creating bigram and trigram tokens that increase the dataset sizes substantially). By randomly sampling thousands of lines from each dataset, we can still get an accurate representation of the larger population that we are sampling from.

‣

Code

Now that we have a sample of each dataset, we will analyze the relationship and frequency of word pairs by their n-grams. More specifically, we will analyze the top 10 most frequent unigrams (single words), bigrams (two word pairs), and trigrams (three word pairs). We will be using the tm and tokenizers packages to create the tokens and load them into dataframes.

First, we will create sub-tables (that we will later feed to gt tables for visualization purposes) of frequent n-gram counts of each dataset. These sub-tables will then be column-binded/merged together into one aggregate table for each n-gram.

Lets now create an aggregate table named topUnigrams that lists the top 10 most frequent unigrams (words) in each dataset, with their respective share.

‣

Code

We can now create the same exact aggregate table for bigrams named topBigrams

‣

Code

Lastly, we will create the same table for trigrams named topTrigrams

‣

Code

Exploratory Analysis

Now that we have our topUnigrams, topBigrams, and topTrigrams tables, we can pipe them into gt functions from the gt package, used for creating aesthetically pleasing tables.

‣

Code

As you can see, the most frequent word in every dataset is by far “the”, with a share ranging from ~3-6% of all words across the datasets. Similarly, the next most common words are mostly all articles. This is to be expected, as articles indeed make up the vast majority of words in most texts of the English language.

‣

Code

Again, we see that many of the top bigrams involve combinations of common articles, as observed when looking at the top unigrams. Interestingly, we can see a slight difference in some of the bigrams in the Twitter dataset such as “thanks for” and “i love”, which to me appears to show the more “personal” component of social media platforms (i.e. the more frequent expression of opinions and emotions, more so than the impartial and formal tone of many news and blog writings).

Additionally, when observing the frequency of bigrams, one can instantly notice that the percent share of the top ten bigrams are significantly smaller than the percent share of the top ten unigrams. This makes sense and is to be expected if you understand the way tokenizers work when creating bigrams (and other higher n-grams) in this context. Two (or more) word pairs are much more distinct and less common relative to the entire dataset because you are only evaluating the occurrence of that exact word pair, as compared to examining each word on a case by case basis (which will inherently “cover” more of the dataset).

‣

Code

Once again, many of the top trigrams contain a mix of many common articles. This time, we can see even more of the difference between the Twitter dataset as compared to the blogs and news datasets. Here, we are able to the see that the phrase “thanks for the follow” appears to be among the most common for Twitter (with the trigrams “thanks for the”, “thank you for” and “for the follow” occupying three of the top ten most frequent trigrams).

Again, we also see that this time the percent share of each top word is essentially 0 for almost every trigram – with the exception of “thanks for the” making up 0.1% of all trigrams in the Twitter dataset. This again relates back to the point I mentioned earlier when speaking about bigrams.

NLP (Natural Language Processing) Predictive Text App

Motive/Background

App

kenen.shinyapps.io

Preview

Code

Milestone Report: Exploratory Analysis of the Text Data

Summary

Loading/reading in Data

Blogs Dataset

News Dataset

Twitter Dataset

Data Processing

Exploratory Analysis