Twitter Scraping!

Twitter API - Scraping Millions of Tweets for Free!

Tools used: Python 3, Tweepy and SQLite3

Dec 2019 by M Savla

Abstract

Twitter provides user's multiple different end-points within their API to scrape and interact with all parts of their service. In this project, we interact with the free search API with a simple but powerful Python script, to gather millions of tweets for all our data science needs.

There are two main steps for this process.

Getting the GEO IDs for each location
Scraping Tweets for each GEO ID

Requirements

The only requirements are Python 3, Tweepy and a database management system of your choice

pip install tweepy

You will also need to get access to Twitter's developer API. There are 4 keys required.

Go to Twitters Developer Dashboard and create a new project. You will then be able to get the Authentication Handler keys and Access Tokens (4 codes in total)

Save these as constants in your script:

TWITTER_AUTH_KEY1 = "lZLDCEeGSDGDSUCoc"
TWITTER_AUTH_SECRET1 = "BNkhUyDSGGDSn8UUD40IFSDFDSevpStBcQazGDSGSDDgDd1LuUF"
TWITTER_ACCESS_KEY1 = "14962050-eEvJo7YnO7n0unCSDFSvDFFvj6S5kAabObTFyN2CEEj"
TWITTER_ACCESS_SECRET1 = "FDSFDSBhdJ7GDSDGimnuP3FSDDFSujl55We"

You will need to setup a few twitter accounts to get multiple API keys. This will allow your script to run at full speed, without being limited by the request limits.

Now we can setup our keys into the tweepy.api method and change a few parameters

auth = tweepy.OAuthHandler(TWITTER_C_KEY, TWITTER_C_SECRET)
auth.set_access_token(TWITTER_A_KEY, TWITTER_A_SECRET)
api1 = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, 
                retry_count=3, retry_delay=5, retry_errors=set([401,404,500,503]))

Getting GEO IDs For Our Locations

We want to scrape Tweets in specific locations (Countries, Cities, Towns, etc...). To do this, we first need to scrape the GEO ID for each location. Here, we can use the geo_search method within Tweepy.

Here, we import a list of city names from a CSV, and loop through them. The GEO ID will be at the 0th index of the data retrieved. Save these and export.

with open('list_of_locations.csv', newline='') as f:
reader = csv.reader(f)
list_of_locations = list(reader)

for loc, in list_of_locations[1:]:
    data = api_list[current_api].geo_search(query=loc)
    all_data.append((loc, data[0].id))

The GEO search method has a limit of 15 requets every 15 minutes. This is where our multiple account setup comes into play. We simply switch to the next API key every 15 iterations in our loop. This can be replaced with error handling, using Twitter's/ Tweepy's limit notifications to help keep things future-proof.

api_list = [api1, api2, api3, api4, api5, api6]

Scraping Tweets

Being a free end-point, there are naturally limits to the amount of tweets available. The free search API gives a sample of the most recent tweets per location, which updates every 15-30 minutes. With this, we can simply stagger our requests by 15 minutes, and leave the script running for a few days, giving us the largest possible batch of tweets.

Here, we simply use the 'search' method in Tweepy, passing in the place ID (GEO ID), and count (number of tweets, limited to 100).

while True:
    for index, (location, location_id) in enumerate(location_ids[1:]):  
        all_data = []  
        tweets = api_list[current_api].search(q="place:%s" % location_id, count=100)
        api_limit_count += 1

        print(f"{index} / {len(location_ids)}, API:, {current_api + 1}, {location}, {location_id}, {len(tweets)}")
        
        for tweet in tweets:
            all_data.append((location, location_id, tweet.created_at, tweet.text))
        
        SQL.insert_data(all_data)

The following response fields are available for each tweet object:

.text - The contents of the tweet
.created_at - Creation timestamp of the tweet
.author_id - Unique ID of the user's account who created the tweet
.attachments - any attachments to the tweet
.geo - data on any location details tagged by the user
.lang - Language of tweet

Once all locations have been scraped, simply add in a time.sleep for the remaining 15 minutes.

Visit Twitter's Offical Documentation for the full list of respone fields, and other end-points of the Twitter API.

To store our tweets, we can create a simple SQLite3 database

import sqlite3

def create_data_table():
    conn = sqlite3.connect('tweets.db')
    c = conn.cursor()

    c.execute("""CREATE TABLE IF NOT EXISTS tweet_data_table(id integer primary key, 
    location TEXT, location_id TEXT, tweet_date TEXT, tweet_text TEXT
    )""")

    c.close()
    conn.close()

def insert_data(insert_data):
    conn = sqlite3.connect('tweets.db')
    c = conn.cursor()

    insert_statement = """INSERT INTO tweet_data_table
    (location, location_id, tweet_date, tweet_text) 
    VALUES (?, ?, ?, ?)"""

    c.executemany(insert_statement, insert_data)
    conn.commit()

    c.close()
    conn.close()

def select_data():
    conn = sqlite3.connect('tweets.db')
    c = conn.cursor()

    c.execute("SELECT * FROM tweet_data_table")
    rows = c.fetchall() 

    c.close()
    conn.close()

    return rows

Remeber that SQL.insert_data() here takes in a lists of lists, rather than a single list.