Extracting Twitter data using Python
Twitter is currently one of the most widely used social networks — 330 million people were using it monthly in Q1 2019. It may not have the explosive growth it had a couple of years ago, but it’s certainly a place where businesses and public persons engage with their audiences, important messages are shared in near-real time and, as recent events show, even politics is done.
Businesses and other actors have long understood the power that data from Twitter has. A typical example we could think of is trading firms. Given the fast-paced and concise format of messages shared, it is a nice tool that could be used to gather signals that could impact the stock market. A CEO commending a product or service? A partnership could be in the making. Multiple complaints from users about building issues in a new device? Bad news for the hardware vendor. So Twitter data does have a business use case.
In this three-part series, we’ll look into a Proof of Concept for extracting, transforming and understanding Twitter data relevant to a specific topic.
If you wish to follow along, please refer to the GitHub repository containing the full Jupyter notebook for this series.
We’re going to use Python 3.7 and a few specialized libraries to get this done.
But first, we’ll need to obtain the necessary credentials from Twitter. This is done here. After filling in the appropriate information, we should have the following:
Once we have the API key, API secret key, Access token and Access token secret, we can process to extract data from Twitter. We’re going to use the Twitter Search API to get our data.
Before we proceed, there’s another thing we should bear in mind though. This is an interface offered by Twitter that has multiple tiers, including the (free) Standard one we’re going to use. Its limitations are, as of writing this article, described as follows:
This search API searches against a sampling of recent Tweets published in the past 7 days. Part of the ‘public’ set of APIs.
Nevertheless, this should be enough for our purposes.
Getting Twitter data
Moving forward, we could either decide between making the API calls directly or using a library that will do so for us. Given the vast choice the Python ecosystem gives us, we’re going to use TwitterSearch, which should help us get moving quicker.
After installing TwitterSearch, we’re ready to go.
pip install TwitterSearch
Let’s import the necessary objects and instantiate the TwitterSearch object using the credentials Twitter has set up for us. In this case, I’ve set up a file that will store them.
We’ll now need to create a Search Order against the Search object we’ve defined above.
In this case, we’ve decided to analyze the messages concerning UK’s upcoming exit from the European Union, colloquially known as Brexit. We’re going to analyze tweets in English.
Now we have a list of tweets, which look like the one below.
We could see that we have a nested dictionary-like structure containing other dictionaries and lists. This needs to be flattened out so we could analyze data more efficiently. The pandas library will facilitate this.
Here we’ve imported pandas and used the json_normalize method to transform our list of results into a pandas DataFrame — a two-dimensional tabular data structure.
Also, here’s a quick view of our data:
To give a more meaningful identifier to each row than the currently used automatically generated row number, we’re going to use the unique tweet id column. We’re also going to drop the id_str column since it’s the string representation of the same tweet id and is thus redundant.
Also, let’s look at how much data we’ve got. This will return (no_of_rows (tweets), no_of_columns (features/variables) ). So we’re currently looking at 18000 tweets with 326 attributes we could analyze.
A view of a subset of columns would be useful here. Let’s look at the date the tweet was created, the user screen name and the text of the tweet. These are only 3 of the several hundred columns available.
In this first post of the series we’ve looked at setting up our Twitter developer credentials, used the TwitterSearch Python package to extract tweets about Brexit and also used the pandas library to flatten (unpack) our results.
In the upcoming articles in this series, we’ll do further transformations of the data we’ve extracted and touch on the notions of Natural Language Processing and Sentiment Analysis.