Sentiment Analysis of Tweets
College:
The Dorothy and George Hennings College of Science, Mathematics, and Technology
Major:
Computer Science
Faculty Research Advisor(s):
Ching-yu Huang
Abstract:
This project will utilize machine learning to capture emotion values from 3 large datasets of tweets. The first dataset is based on hate speech/cyberbullying tweets, the second dataset is based on a more general collection of tweets, and the third dataset is related to the Covid-19 hashtag. The hate speech dataset includes an ID and the Tweet as attributes, the general dataset includes the ID, the Tweet, and a label, and the covid-19 dataset includes 10 attributes including the Tweet and a unique Username, but not all will likely be used as they are unrelated to the sentiment of the tweet. The goal of this project will be to be able to correctly predict the sentiments of tweets with a high degree of accuracy, likely around 80% barring extreme processing times. The data mining process to be used is called sentiment analysis and quantifies emotions like rage or joy as negative (1) or positive (0) values. Machine learning techniques in the deep learning and natural language processing areas will be used, with a pretrained model like the BERT model. A tokenizer import will be used that provides “tokens” which are subsets of a phrase or sentence that conveys the sentiment of the total phrase/sentence. Python programming will be used for our machine learning purposes and the Kean University Obi2 database will be used for data storage and retrieval using MySQL.