METHODOLOGY

Photo: Jonathan Moore/Getty Images

METHODOLOGY

I plan to use tweets to analyze the career and legacy of Kobe Bryant. This is more qualitative research since I will be working with people’s comments and feelings rather than numbers. Twitter is a popular social media platform where users can share their feelings in 280 characters.

To access and scrape tweets from Twitter, I had to set up a developer account and get the appropriate API credentials. From here, I learned that you can only scrape tweets from the past seven days.

In my earlier research on sentiment analysis, I found a few successful methods for research that I plan to mimic in my project. Yang Yu and Xiao Wang conducted a sentiment analysis on the 2014 World Cup using tweets from U.S. sports fans. In their project, they gathered data and scraped data from Twitter during games the U.S. National Team was playing. They then took this data and began to break it up into single important words to analyze.

Scrape Tweets
Remove Unnecessary Characters and Links
Assign StopWords
Tokenize Tweets
Create WordClouds

Data Collection

Using a combination of the Earth Data Science and Marco Bonzanini’s Twitter scraping tutorials and the help of my professors Dr. Curt Rode and Dr. Sean Crotty, I was able to write a python script that would return a JSON file full of tweets containing the requested keyword or hashtag. I would then convert the JSON file to a CSV file containing my tweets with a second python script.

To access tweets that are specifically talking about Bryant, I used specific keywords and hashtags such as ‘RIPKobe’, ‘Kobe’ and ‘GirlDad.’ For each phrase, I scraped at least 1,000 tweets. Since I was unable to find a sufficient amount of negative tweets in late March and April, I performed an advanced Google search for articles that specifically mentioned Bryant’s sexual assault case.

Data Cleaning and Processing

In all three of the sentiment analysis projects that I researched, removing stopwords was an essential step. Stopwords are commonly used words such as “the”, “a”, “an”, “in”, etc. These can quickly be removed from my twitter data. Another essential step is cleaning up the text. After a few preliminary data scrapes, I noticed URLs, usernames and other clutter appearing in my main text. To properly analyze my data, I needed to get it into a clean and usable format.

In my research of Twitter data, I came across a Kaggle user who published a journal on Efficient Tweet Preprocessing using the package tweet-preprocessor. This library had unique function that could easily clean tweets that contained URL’s, emojis, and unwanted characters. Using a few functions, I was able to clean my original data and create a new usable CSV file.

Next, I installed and used the genism library. This special python package was created for sentiment analysis, so it contained a helpful function for removing simple stopwords.

Creating WordClouds

In a 2017 Twitter study on consumer insights for Jet Airways, Vandana Ahuja and Moonis Shakeel used WordCloud visualizations to display their findings. I felt that this was an appropriate and interesting way to display my conclusion because the more times the word is used, the larger it will appear in the visualization. This will allow me to effectively show the words that were commonly used in tweets about Kobe Bryant and help me understand what people feel his legacy is.

I was able to find some sources that gave step-by-step walkthroughs on how to take these individual words and create a visualization that will map out the frequency different words are used based on the size it appears in.

Using the DataCamp tutorial from Duong Vu and the wordcloud python package I was able to take my Twitter data, tokenize it, remove custom stopwords and generate WordClouds.

The wordcloud library had a function within it that took my twitter data and tokenized it. Tokenization is the process of taking larger bodies of text, for example, a tweet and splitting it into smaller phrases or even single words. This gave me a word bank containing hundreds of thousands to millions of words that were used to talk about or describe Kobe Bryant.

From this point, I generated a WordCloud from the now tokenized data. It revealed the words that were commonly used in tweets about Bryant. I went in and added custom stopwords to remove any other swear words and phrases that appeared as spam.

To get my WordClouds to appear in the shape of Kobe Bryant, I used Photoshop to create a mask that would be filled with the most common words used in the tweets.

All of the code and sample data can be found on my GitHub page.