Sentiment Analysis of Twitter Data
INTRODUCTION
The COVID-19 pandemic has already impacted thousands of lives in myriad number of ways. The educational institutions are also suffering as a result of this as they are not able to provide adequate internship opportunities and exposure to their students as organizations are not able to afford any more interns or workforce. Yet, some institutions try to make sure that we get the required experience regardless of the circumstances. I would like to extend my thanks to UST Software for providing me the opportunity to work on a project involving sentiment analysis which very well in alignment with my interests in areas of machine learning and natural language processing.
Sentiment Analysis is basically classification of emotions (positive, neutral or negative) or sentiment of some text data using relevant techniques in the domains of machine learning and natural language processing. Our objective was to extract Twitter Data and build a sentiment model on it to extract information about sentiments regarding any topic (For example — Covid 19) and also be able to visualize it for a much better understanding of the whole scenario.
Sentiments of different countries were summarized and a dashboard was built to present the information in visual format.
An interface to collect Twitter data was also implemented which stored the data in MongoDB.
APPROACHES:-
- First approach utilized tweepy for collection of data, MySQL for storage, text blob and nltk for pre-processing and the sentiment analysis, plotly for visualization and dashboard.
2. The first approach was later discarded in favour of a better one which used VADER (Valence Aware Dictionary for Sentiment Reasoning) model. It is able to map lexical features to emotion intensities known as sentiment scores. Sentiment score for each location is found out and then Dash and Matplotlib modules are employeed for data visualization.
DATA VISUALIZATION
I primarily worked on this part of the whole project. So I will go into detail about how we were able to make it work. First of we would obtain database of tweets related to some certain topics. Then a dropdown is created with options as topics. In Dash we can specify that upon change in a particular input field, there should be a change in output as well of some element using callback decorator. By default, coronavirus_covid topic is selected. Then bars pertaining to some specific country are drawn with one bar for each country. The bar for each country itself is divided into 3 parts:- 1 for positive, negative or neutral sentiments each. On x-axis we have different countries and on y-axis we have no of tweets with negative, positive and neutral sentiments.
Since the plot is interactive, you can also disable positive, negative or neutral selections by clicking on their labels.
METHODOLOGY
- First the data is collected and stored in MongoDB. Location data that is present in tweets is not satisfactory since most of the tweets do not have location information. So as an alternative, we obtain location of user instead of the tweet.
- Filtering of data is of paramount importance as location data obatined earlier is not complete and refined all by itself. There are some entries where some other info was found instead of a real valid location. A module called pycountry was used to filter out only those tweets which had an invalid location. We had to make sure that all abbrevations and full country names don’t count as separate entities. Then collection for each country was built after preprocessing with NLTK.
- Then analysis is done using VADER. VADER is pretty fast even with large amount of collections and data.
- Then results are visualized using Dash and Matplotlib. Graphs are plotted as described above and also a globe kind of map is generated.
Conclusion:-
It was a fun and learning experience. Due to extensive research, we got to learn about various models and techniques. We were able to analyze those techniques and come up with the most feasible solution given our experience and knowledge and achieve the desired results. We got to acquaint ourselves with entirely new modules such as Dash and learnt a good about plotting and visualization with Python. Frontend for fetching tweets was also built which exposed us to CSS and Bootstrap (with which we didn’t have much of experience). Many bugs were encountered and we spent a lot of time on solving those as is commonplace in a developer’s routine.
HAPPY CODING!