Pete Hinchley: Archive Your Twitter Timeline into a Sqlite Database using PHP

This article is based on a deprecated version of the Twitter API.

In this article I will describe how you can archive your Twitter timeline into a sqlite database using PHP.

You can view the working code on Github.

Before using the code, you will need to create a Twitter application. This straightforward process is necessary to generate the tokens required for authenticating to the Twitter REST API. In particular, you will need to generate a consumer key and secret, and an OATH token and secret. Having generated these values, you must insert them into the code under the comment: define api keys.

The code calls the REST API to retrieve a collection of tweets. Each request is limited to 200 records, with an overall limit of 3,200 records. If you want to get older tweets, you will need to leverage an additional step, such as merging the data retrieved via the API with the older tweets obtained via a manually downloaded Twitter archive.

A good chunk of the code is used to build an OAuth token necessary to authenticate to the REST API. The remainder of the code parses the JSON response, and adds each tweet to a sqlite database (the database is created automatically when the script is first executed).

In addition to storing the full JSON representation of each tweet, I have also pulled out a subset of properties into separate database fields. This includes the tweet's unique ID, full text content, and creation timestamp. I also store a flag indicating if the tweet was a reply or a retweet, and if so, the unique ID of both the corresponding tweet and its author.

The code processes the Twitter timeline using an approach named cursoring. The official documentation provides a detailed explanation of this technique, but at a minimum, you should know that the timeline is processed in reverse chronological order (from newest to oldest), and the tweets retrieved with each request are governed by two key values: max_id and since_id. The max_id value identifies the ID of the oldest tweet you have processed, and since_id identifies the ID of the newest tweet.

To clarify the use of these variables, let's say you are reading a timeline of 2,000 tweets in chunks of 200. The first request will be made without the max_id value. This request will return the most recent 200 tweets from your timeline. The second request will be made with max_id set to the ID of the oldest tweet from the previous request; in effect, you are saying, get me the next 200 tweets, starting with the first tweet older than the tweet identified by max_id. This process repeats until the entire timeline is processed.

The since_id value is useful when you want to run the code multiple times (e.g. once per day to pick up new tweets). By setting since_id to the ID of the most recent tweet, the code will ensure it only reads tweets with higher IDs (i.e. more recent tweets). In this way, you avoid processing your entire timeline with each invocation of the script (and you also avoid adding the same tweets to the database over and over again).

Having archived your Twitter timeline, it's easy to work with the data. For example to display the text of a random tweet:

$d = new PDO('sqlite:./db.sqlite3');
$q = 'SELECT text, timestamp, isreply, isretweet FROM tweets ORDER BY RANDOM() LIMIT 1';
$t = $d->query($q)->fetch();
echo $t['text'];

And to display the most recent tweet, just change the query statement to the following:

$q = 'SELECT text, timestamp, isreply, isretweet FROM tweets ORDER BY guid + 0 DESC LIMIT 1';

It is also easy to parse the raw text of each tweet, converting links and Twitter usernames into clickable hyperlinks. For example:

function clickable($s) {
  $s = preg_replace('@(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?)@', '<a href="$1" target="_blank">$1</a>', $s);
  return preg_replace('#@(\w{1,15})\b#', '<a href="$1" target="_blank">@$1</a>', $s);

echo clickable($t['text']);

A couple of other points on the solution: