Tweet Dump

So one day I was browsing the API of Twitter. And on this particular page on Twitter's website, they talked using a command line utility to interact with Twitter. I was intrigued by this, so I decided to write up a simple script to deposit the tweets and twitter ID's into a database. Why? Because I could. ;)

So if for whatever reason you have to recreate my little experiment, you'll need a couple of things:


    1) A unix shell
      1.1) grep

    2)python installed and working
      2.1) the mysqldb python module installed

    3)mysql database installed

So here is my MySQL schema:

  1. -- MySQL dump 10.11
  2. --
  3. -- Host: localhost Database: twit_bank
  4. -- ------------------------------------------------------
  5. -- Server version 5.0.84-log
  6.  
  7. /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
  8. /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
  9. /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
  10. /*!40101 SET NAMES utf8 */;
  11. /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
  12. /*!40103 SET TIME_ZONE='+00:00' */;
  13. /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
  14. /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
  15. /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
  16. /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
  17.  
  18. --
  19. -- Table structure for table `ids`
  20. --
  21.  
  22. DROP TABLE IF EXISTS `ids`;
  23. /*!40101 SET @saved_cs_client = @@character_set_client */;
  24. /*!40101 SET character_set_client = utf8 */;
  25. CREATE TABLE `ids` (
  26. `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  27. `twit_id` BIGINT(20) DEFAULT NULL,
  28. UNIQUE KEY `id` (`id`)
  29. ) ENGINE=MyISAM AUTO_INCREMENT=50250 DEFAULT CHARSET=latin1;
  30. /*!40101 SET character_set_client = @saved_cs_client */;
  31.  
  32. --
  33. -- Table structure for table `tweets`
  34. --
  35.  
  36. DROP TABLE IF EXISTS `tweets`;
  37. /*!40101 SET @saved_cs_client = @@character_set_client */;
  38. /*!40101 SET character_set_client = utf8 */;
  39. CREATE TABLE `tweets` (
  40. `ids` INT(10) UNSIGNED DEFAULT NULL,
  41. `tweet` CHAR(150) DEFAULT NULL
  42. ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  43. /*!40101 SET character_set_client = @saved_cs_client */;
  44. /*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;
  45.  
  46. /*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
  47. /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
  48. /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
  49. /*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
  50. /*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
  51. /*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
  52. /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
  53.  
  54. -- Dump completed on 2009-10-19 0:22:26

And once you have that schema install, you can use the script below (Obviously you'll need to change the username and password to your particular database, but you don't need me to tell you that. My readers are S-M-A-R-T!)

  1. #!/usr/bin/python
  2.  
  3. import os
  4. import subprocess
  5. import MySQLdb
  6. import re
  7. import sys
  8.  
  9. #variables
  10. databaseUser = "*****"
  11. databasePassword = "*****"
  12. databaseName = "twit_bank"
  13.  
  14.  
  15. def runQuery(query):
  16.  
  17. db = MySQLdb.connect(user=databaseUser, passwd=databasePassword, db=databaseName)
  18. c = db.cursor()
  19. c.execute(query)
  20. db.commit()
  21.  
  22. return c.fetchall()
  23.  
  24. def getId(in_id):
  25. sql = "select id from ids where twit_id = '" + in_id + "'"
  26. return runQuery(sql)
  27.  
  28. def addId(in_id):
  29. sql= "insert into ids values (null, '" + in_id + "')"
  30. runQuery(sql)
  31.  
  32. def addTweet(in_id, in_tweet):
  33. sql = "insert into tweets values ('" + in_id + "',\"" + in_tweet + "\")"
  34. runQuery(sql)
  35.  
  36. def main():
  37.  
  38. final_texts = list()
  39. j = 0
  40. site_id = list()
  41.  
  42. #code structure grabbed from:
  43. #http://buffis.com/2007/01/09/getting-output-of-a-system-call-in-python/comment-page-1/
  44.  
  45. command = "curl <a href="http://twitter.com/statuses/public_timeline.xml" title="http://twitter.com/statuses/public_timeline.xml">http://twitter.com/statuses/public_timeline.xml</a> | grep -A 3 '<status>' "
  46. process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
  47. os.waitpid(process.pid, 0)
  48. curl = process.stdout.read().strip()
  49. sys.stdout.flush()
  50.  
  51. ids = re.findall("[0-9]{10,13}", curl)
  52. texts = re.findall("<text>[\x20-\x7E]{1,150}", curl)
  53.  
  54. if ( len(ids) == len(texts)):
  55. for i in texts:
  56. temp = i.lstrip('<text>')
  57. final_texts.append(temp.rstrip('</text>'))
  58.  
  59. while( j < 19):
  60. site_id = getId(ids[j])
  61. if( not site_id):
  62. addId(ids[j])
  63. site_id = getId(ids[j])
  64.  
  65. addTweet(str(site_id[0][0]), final_texts[j])
  66. j = j + 1
  67.  
  68. main()

And this final last piece to put it all together, so you don't have to do all the grabbing manually.(Don't forget to change the location of the script.)

  1. #!/bin/bash
  2.  
  3. while [ 1 ]
  4. do
  5. /home/bryce/programming/personal/scripts/python/twit2mysql.py
  6. sleep 30
  7. done

Now we get to the fun part. After about 24 hours of grabbing the public timeline at 40 tweets a minute I got 50249 ID's and 50234 tweets on record. Now while watching the numbers grow, there was a weird issue. Most of the time there were more ID's than tweets. I'm not sure why this is at the moment, but I'm curious enough to try and solve this bug.

Oh, and something that people should probably know. The reason why I'm only grabbing 40 tweets a minute is because the first time I tried to run this, I was grabbing more like 120 tweets a minute. And I was blocked by Twitter for viewing the public timeline more than 150 times in one hour. And I was blocked for the rest of the hour. So consider that your warning.

Great post!

Great! I followed this and it really is cool! Thanks for posting. I am glad that you have warned us about the number of tweets per minute. If not, I have been blocked by twitter as well. Till your next post!

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share