Twitter data– the endless stream of tweets, the user network, and the rise and fall of hashtags– offers a flood of insight into the minute-by-minute state of the society. Or at least one self-selecting part of it. A lot of people want to use it for research, and it turns out to be pretty easy to do so.
You can either purchase twitter data, or collect it in real-time. If you purchase twitter data, it’s all organized for you and available historically, but it basically isn’t anything that you can’t get yourself by monitoring twitter in real-time. I’ve used GNIP, where the going rate was about $500 per million tweets in 2013.
There are two main ways to collect data directly from twitter: “queries” and the “stream”. Queries let you get up to 1000 tweets at any point in time– whichever the most recent tweets that match your search criteria. The stream gives you a fraction of a percent of tweets continuously, which very quickly adds up, based on filtering criteria.
Scripts for doing these two options are below, but you need to decide on the search/streaming criteria. Typically, these are search terms and geographical constraints. See Twitter’s API documentation to decide on your search options.
Twitter uses an athentication system to identify both the individual collecting the data, and what tool is helping them do it. It is easy to register a new tool, whereby you pretend that you’re a startup with a great new app. Here are the steps:
- Install python’s twitter package, using “easy_install twitter” or “pip install twitter”.
- Create an app at https://apps.twitter.com/. Leave the callback URL blank, but fill in the rest.
- Set the CONSUMER_KEY and CONSUMER_SECRET in the code below to the values you get on the keys and access tokens tab of your app.
- Fill in the name of the application.
- Fill in any search terms or structured searches you like.
- If you’re using the downloaded scripts, which output data to a CSV file, change where the file is written, to some directory (where it says “twitter/us_”).
- Run the script from your computer’s terminal (i.e., python search.py)
- The script will pop up a browser for you to log into twitter and accept permissions from your app.
- Get data.
Here is what a simple script looks like:
import os, twitter APP_NAME = "Your app name" CONSUMER_KEY = 'Your consumer key' CONSUMER_SECRET = 'Your consumer token' # Do we already have a token saved? MY_TWITTER_CREDS = os.path.expanduser('~/.class_credentials') if not os.path.exists(MY_TWITTER_CREDS): # This will ask you to accept the permissions and save the token twitter.oauth_dance(APP_NAME, CONSUMER_KEY, CONSUMER_SECRET, MY_TWITTER_CREDS) # Read the token oauth_token, oauth_secret = twitter.read_token_file(MY_TWITTER_CREDS) # Open up an API object, with the OAuth token api = twitter.Twitter(api_version="1.1", auth=twitter.OAuth(oauth_token, oauth_secret, CONSUMER_KEY, CONSUMER_SECRET)) # Perform our query tweets = api.search.tweets(q="risky business") # Print the results for tweet in tweets['statuses']: if not 'text' in tweet: continue print tweet break
For automating twitter collection, I’ve put together scripts for queries (search.py), streaming (filter.py), and bash scripts that run them repeatedly (repsearch.sh and repfilter.sh). Download the scripts.
To use the repetition scripts, make the repetition scripts executable by running “chmod a+x repsearch.sh repfilter.sh“. Then run them, by typing ./repfilter.sh or ./repsearch.sh. Note that these will create many many files over time, which you’ll have to merge together.