Machine Learning, NLP, BERT, Keras, Visualization, D3.js, Python, Twitter, Django, SQL, AWS
August 5, 2019
In this project, I...
- Harvest ~800k tweets/day from Twitter's streaming API with Python
- Trained and implemented NLP models with BERT/Keras to identify both anger and hate speech in tweets.
- Invented new word-counting tricks and techniques to detect new topics and annotations.
- Constructed a pipeline on AWS to run every tweet through all models, label and store in MySQL Database
- Display encoded results in a visualization web app built with D3.js and Django.
This was the final Capstone project that was the culmination of my Master's Degree in Information and Data Science. For this project, I recruited three other data scientists, pitched them on the idea, led them through 14 rigorous weeks of work, and motivated them to keep pushing through doubts and compressed schedules.
The project was invited to present at the Capstone Showcase, alongside the seven other top projects (from a field of 24).
For this project, we constructed and trained three machine learning NLP models using Keras on top of BERT to identify anger and hate speech in tweets. One is a binary hate model trained with 10M rows of distantly-labeled data, followed by a direct-supervision multiclass model to determine the target of the hate speech. The other is a binary anger model, using 13k hand-labeled tweets.
Every day, we collect nearly a million tweets as a random sample from Twitter and store them in an S3 bucket on AWS. These tweets are condensed down into daily parquet files, which are then fed through the models every night at midnight. Once all the tweets have been labeled as anger, hate, or not, they are then fed through a topic detection model to label what the users are angry about, and then an annotation model that creates a query for Google's news API in order to pull the most relevant headline, if applicable, to contextualize why users are angry.
The resulting data is then stored in a MySQL table on Amazon RDS, which powers the visualization on the website, which is built with D3.js and hosted in a Django framework. For more information, you can visit the About this Project" page on the Outrage|Us Website.