Tweet Dump

So one day I was browsing the API of Twitter. And on this particular page on Twitter's website, they talked using a command line utility to interact with Twitter. I was intrigued by this, so I decided to write up a simple script to deposit the tweets and twitter ID's into a database. Why? Because I could. ;)

So if for whatever reason you have to recreate my little experiment, you'll need a couple of things:


    1) A unix shell
      1.1) grep

    2)python installed and working
      2.1) the mysqldb python module installed

    3)mysql database installed

So here is my MySQL schema:

  1. -- MySQL dump 10.11
  2. --
  3. -- Host: localhost Database: twit_bank
  4. -- ------------------------------------------------------
  5. -- Server version 5.0.84-log
  6.  
  7. /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
  8. /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
  9. /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
  10. /*!40101 SET NAMES utf8 */;
  11. /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
  12. /*!40103 SET TIME_ZONE='+00:00' */;
  13. /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
  14. /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
  15. /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
  16. /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
  17.  
  18. --
  19. -- Table structure for table `ids`
  20. --
  21.  
  22. DROP TABLE IF EXISTS `ids`;
  23. /*!40101 SET @saved_cs_client = @@character_set_client */;
  24. /*!40101 SET character_set_client = utf8 */;
  25. CREATE TABLE `ids` (
  26. `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  27. `twit_id` BIGINT(20) DEFAULT NULL,
  28. UNIQUE KEY `id` (`id`)
  29. ) ENGINE=MyISAM AUTO_INCREMENT=50250 DEFAULT CHARSET=latin1;
  30. /*!40101 SET character_set_client = @saved_cs_client */;
  31.  
  32. --
  33. -- Table structure for table `tweets`
  34. --
  35.  
  36. DROP TABLE IF EXISTS `tweets`;
  37. /*!40101 SET @saved_cs_client = @@character_set_client */;
  38. /*!40101 SET character_set_client = utf8 */;
  39. CREATE TABLE `tweets` (
  40. `ids` INT(10) UNSIGNED DEFAULT NULL,
  41. `tweet` CHAR(150) DEFAULT NULL
  42. ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  43. /*!40101 SET character_set_client = @saved_cs_client */;
  44. /*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;
  45.  
  46. /*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
  47. /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
  48. /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
  49. /*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
  50. /*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
  51. /*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
  52. /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
  53.  
  54. -- Dump completed on 2009-10-19 0:22:26

And once you have that schema install, you can use the script below (Obviously you'll need to change the username and password to your particular database, but you don't need me to tell you that. My readers are S-M-A-R-T!)

  1. #!/usr/bin/python
  2.  
  3. import os
  4. import subprocess
  5. import MySQLdb
  6. import re
  7. import sys
  8.  
  9. #variables
  10. databaseUser = "*****"
  11. databasePassword = "*****"
  12. databaseName = "twit_bank"
  13.  
  14.  
  15. def runQuery(query):
  16.  
  17. db = MySQLdb.connect(user=databaseUser, passwd=databasePassword, db=databaseName)
  18. c = db.cursor()
  19. c.execute(query)
  20. db.commit()
  21.  
  22. return c.fetchall()
  23.  
  24. def getId(in_id):
  25. sql = "select id from ids where twit_id = '" + in_id + "'"
  26. return runQuery(sql)
  27.  
  28. def addId(in_id):
  29. sql= "insert into ids values (null, '" + in_id + "')"
  30. runQuery(sql)
  31.  
  32. def addTweet(in_id, in_tweet):
  33. sql = "insert into tweets values ('" + in_id + "',\"" + in_tweet + "\")"
  34. runQuery(sql)
  35.  
  36. def main():
  37.  
  38. final_texts = list()
  39. j = 0
  40. site_id = list()
  41.  
  42. #code structure grabbed from:
  43. #http://buffis.com/2007/01/09/getting-output-of-a-system-call-in-python/comment-page-1/
  44.  
  45. command = "curl <a href="http://twitter.com/statuses/public_timeline.xml" title="http://twitter.com/statuses/public_timeline.xml">http://twitter.com/statuses/public_timeline.xml</a> | grep -A 3 '<status>' "
  46. process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
  47. os.waitpid(process.pid, 0)
  48. curl = process.stdout.read().strip()
  49. sys.stdout.flush()
  50.  
  51. ids = re.findall("[0-9]{10,13}", curl)
  52. texts = re.findall("<text>[\x20-\x7E]{1,150}", curl)
  53.  
  54. if ( len(ids) == len(texts)):
  55. for i in texts:
  56. temp = i.lstrip('<text>')
  57. final_texts.append(temp.rstrip('</text>'))
  58.  
  59. while( j < 19):
  60. site_id = getId(ids[j])
  61. if( not site_id):
  62. addId(ids[j])
  63. site_id = getId(ids[j])
  64.  
  65. addTweet(str(site_id[0][0]), final_texts[j])
  66. j = j + 1
  67.  
  68. main()

And this final last piece to put it all together, so you don't have to do all the grabbing manually.(Don't forget to change the location of the script.)

  1. #!/bin/bash
  2.  
  3. while [ 1 ]
  4. do
  5. /home/bryce/programming/personal/scripts/python/twit2mysql.py
  6. sleep 30
  7. done

Now we get to the fun part. After about 24 hours of grabbing the public timeline at 40 tweets a minute I got 50249 ID's and 50234 tweets on record. Now while watching the numbers grow, there was a weird issue. Most of the time there were more ID's than tweets. I'm not sure why this is at the moment, but I'm curious enough to try and solve this bug.

Oh, and something that people should probably know. The reason why I'm only grabbing 40 tweets a minute is because the first time I tried to run this, I was grabbing more like 120 tweets a minute. And I was blocked by Twitter for viewing the public timeline more than 150 times in one hour. And I was blocked for the rest of the hour. So consider that your warning.

My First Python Script

I would like to introduce everyone to my first Python script. Just for the sake of full disclosure it was originally put together by a coworker and then passed off to me for completion. I haven't had much interest until recently to even touch Python. I guess all of us should be so lucky to have a job that forces us to grow in unexpected directions every once and a while.

The script itself is nothing overly complicated, but I'm still proud of it. The reason for that is because I had to dissect the language and semantics to figure out what functionality was still needed to be added to the script. As well as learn to read the language in order to work with the parts that were existing before I touched it.

Just a quick overview of what it actually does ( or is at least suppose to do ;) ). In a MySQL database somewhere in the world, there is a database that holds a schema that was created to try and allow for dynamic recreation of Apache virtualhosts config files at the push of a button. (I'm being a little terse regarding things because I really enjoy the project and have an upcoming blog post with more details. So I don't want to spoil all the fun now. :) )

  1. # -*- coding: utf-8 -*-
  2. import commands
  3. import datetime
  4. import unittest
  5. import MySQLdb
  6. import sys
  7.  
  8. ###################### Constants #############################
  9.  
  10. # What user should we connect to the database as, for updating tables and such?
  11. databaseUser = "******"
  12.  
  13. # What is the password to the database for the given user?
  14. databasePassword = "******"
  15.  
  16. # What is the name of the database that we are connected to?
  17. databaseName = "apache_config"
  18.  
  19. ##################### Begin Code Implementation ###################
  20.  
  21. # Returns a set containing all tuples satisfying the given query.
  22. def runQuery(query):
  23. db = MySQLdb.connect(user=databaseUser, passwd=databasePassword, db=databaseName)
  24. #db = MySQLdb.connect(host=host,user=databaseUser, passwd=databasePassword, db=databaseName)
  25.  
  26. # Execute the query and return the data.
  27. c = db.cursor()
  28. c.execute(query)
  29. return c.fetchall()
  30.  
  31. # Retrieves the master table that holds information on each of the virtualhosts currently being stored.
  32. def retrieveMasterVirtualhostTable(in_virtualhost):
  33. return runQuery("select name, ipaddress from virtualhosts WHERE name = \"" + in_virtualhost + "\"" )
  34.  
  35. #get all information for each virtualhosts based on virtualhost name
  36. def retrieveSitesbyVirtualhosts(in_virtualhost):
  37. return runQuery("SELECT virtualhosts.name, virtualhosts.ipaddress, site.site_name, site.docroot, site.admin_email, site.redirects, site.misc, site.site_id, site.misc FROM site,virtualhosts WHERE site.vhost_id = virtualhosts.id AND virtualhosts.name= \"" + in_virtualhost + "\"")
  38.  
  39. # retrieve all aliaeses for a particular site
  40. def retrieveAliasesbysite_id(in_aliases):
  41. return runQuery("SELECT alias_name FROM site_aliases WHERE site_id= " + in_aliases )
  42.  
  43. # Given a virtualhost entry, returns the header for the associated virtualhost.
  44. def formatVirtualhostHeader(virtualhost):
  45. return 'NameVirtualHost ' + virtualhost[1] + '\nLogFormat "%v %h %l %u %t \\\"%r\\\" %>s %b" vhost\n'
  46.  
  47. def convertWebsiteToVirtualhostEntry(site):
  48. if( not site[5] ):
  49. if ( site[4]):
  50. ret= "\n<VirtualHost " + site[1] + ">\n" + \
  51. " ServerAdmin " + "webmaster@" + site[2] + "\n" + \
  52. " DocumentRoot " + site[3] + "\n" + \
  53. " ServerName " + site[2] + "\n"
  54. else:
  55. ret= "\n<VirtualHost " + site[1] + ">\n" + \
  56. " DocumentRoot " + site[3] + "\n" + \
  57. " ServerName " + site[2] + "\n"
  58.  
  59. # Add in site aliases.
  60. for alias in retrieveAliasesbysite_id(str(site[7])):
  61. ret += " ServerAlias " + alias[0] + "\n"
  62.  
  63. ret+=\
  64. " ErrorLog /usr/local/var/apache/log/" + site[0] + ".error_log\n" + \
  65. " CustomLog /usr/local/var/apache/log/" + site[0] + ".access_log vhost\n"
  66.  
  67. if site[8]:
  68. ret += site[8] + "\n"
  69.  
  70. else:
  71. ret= "\n<VirtualHost " + site[1] + ">\n" + \
  72. " ServerName " + site[2] + "\n"
  73.  
  74. for alias in retrieveAliasesbysite_id(str(site[7])):
  75. ret += " ServerAlias " + alias[0] + "\n"
  76.  
  77. ret += " Redirect " + site[5] + "\n"
  78.  
  79. ret += "</Virtualhost>"
  80. return ret
  81.  
  82. # Execution begins here.
  83. def main():
  84.  
  85. # grab information pertaining
  86. virtualhost = retrieveMasterVirtualhostTable(sys.argv[1])
  87.  
  88. # print the format stuff out at the top of the file
  89. print formatVirtualhostHeader(virtualhost[0])
  90.  
  91. #now grab sites based on the virtualhost called
  92. virtualhostMaster = retrieveSitesbyVirtualhosts(virtualhost[0][0])
  93.  
  94. # For each site in the virtualhostMaster "list"..
  95. for v in virtualhostMaster:
  96.  
  97. ### Print entire output to the screen
  98. print convertWebsiteToVirtualhostEntry(v)
  99.  
  100. main()

I would just like to point out that although I'm proud of my work. I'm not sharing this to try and receive praise from the internet. I'd probably have better luck getting a STD from the internet than praise for this little script. On the contrary, I'm posting this because I'm new to Python, and I would like someone who has a little more experience to guide me in the places that I might not have used the best python methods to get things accomplished. For example, the line:

  1. virtualhostMaster = retrieveSitesbyVirtualhosts(virtualhost[0][0])

Although this works, my intuition tells me that this isn't a very python way of getting that data, using a double index like that. However I just haven't been able to find the correct way to do this. So if someone out there knows of a better way, I'm all eyes!

Mistakes

For a tech blog, I feel I write a lot of stories. And event though I want to share what I've learned, I also feel that including a little bit of background adds a nice layer of context to what information I'm trying to pass along. I definitely don't wanna write a novel, but I feel that a little backlog might add an extra element to what I'm trying to convey.

Like this story: A couple of months ago I saw an article posted by HowToGeek on video recording in VLC. After reading it I went home to try it out, and didn't get any luck, VLC didn't record for me. At which point I commented on the blog saying as much. (The date of my post is July 22, 2009 at 10:19 pm).

Here we are now, almost three months later and I have to eat those words. VLC record does work in linux. But if your using Gentoo you need to make sure you have the right USE variables set or else it won't record.

I came across this while trying to figure out how to get x264 encoding to work in ffmpeg. And in some random google link I came across this webpage hosted on VLC's website. It basically contained the “secret sauce” to build VLC properly in Gentoo to get VLC to record. So of course I see and follow through with the website of adding the line:

media-video/vlc wxwindows aac dts a52 ogg flac theora oggvorbis matroska freetype bidi xv svga gnutls stream vlm httpd cdda vcd cdio live

to my /etc/portage/packages.use, then recompiled. And VIOLA, the record button now functions properly. I admit to feeling a little sheepish for posting my comment so quickly. And of course now I feel compelled to post again to that article telling them I was wrong. Also, I feel like I should also post a comment to the gentoo forums because I was not able to find anything there regarding this problem, and hopefully I can help someone out before it becomes a three month problem for them.

Internet, I learned my lesson. From now on I will do better due diligence regarding my research before I post comments again. Can you forgive me? Thanks!

Ingenuity

So its been a while since I've posted last. Really intense summer school and a vacation will do that to a blog. However, in my head I haven't stopped writing, so I have a couple of postings that just need to get written. Like this one. :)

So usually when I hear the word ingenuity, I think of the fictional character MacGyver. Who as the pop culture jokes go could make a bomb out of a rubber band, a paper clip, and a button. Or anything else he could put together that was in front of him. But you don't have to be MacGyver to have a moment of ingenuity.

Back in August I had the experience of riding my motorcycle down from Portland to San Francisco. And the event was around nine at night, on highway I-5 in southern Oregon. Just to give you an idea of the environment that I was riding in, close your eyes while standing outside. That is how much light is actually lighting the highway. The only way I can see what is in front of me is the light from my bike. And being this is just a single motorcycle headlight, you don't get a lot of light to see with.

While riding along, an object suddenly appears in my view, a dead deer. Just laying in the center of the fast lane. At the speed I was traveling at I was unable to stop safely, so I had to swerve. Which I tried to do, I didn't succeed. And somehow I rode my motorcycle OVER the deer carcass. After using the deer as a jump ramp and landing though, my bike started to make a noise. I thought at the moment that it might be wheel or tire related. So I pulled over, and tried to see what the damage was. But again, no outside lights, and I forgot to pack a flash light ( lucky me ). And being around forty miles to my destination for the fist leg of my trip. I decided I would just go really slow and hopefully I would make it. So I put on my hazard lights, and slowly rode down the should. After a couple of miles I came across a rest stop and the rest stop had street lights. So I pulled over under one of the lights to finally find out what was wrong with my bike. And as soon as I got off the bike and looked I saw my problem. Somehow when I went over the deer, a part of the deer took off the spring that kept my center stand up. So that weird sound I heard while the bike was moving was my center stand hitting the ground.

Only being about thirty miles away at this point, I needed to figure out a way to keep the center stand from hitting the ground. And all I had on me was clothes. So I needed to get creative to solve this problem. I looked around the rest stop, but those places are kept pretty clutter free, so I wasn't able to find anything that I could use to keep the center stand up. So after a while, a moment of ingenuity hit me. And I started to dig through my bag, and pulled out one of my socks. Which I tied one side to the center stand, and the other to part of my motorcycle frame. Using this sock I was able to ride at normal speeds to my destination. Where I quickly went to sleep and temporarily forgot about the problem of how to continue my trip to San Francisco.

Just to tie up this loose end in the story. The next day I was able to acquire a bungee cord. Which I used to keep the center stand attached to my bike. About a month later, that bungee cord is still keeping that center stand up. And probably will till next riding season.

You might now be thinking, “Bryce, you write a tech blog. Why are you telling us this story?” And my immediate response is to say, "That is a good question." I would like to say that there is some kind of connection between ingenuity and technology. No examples of this come to mind, but I'm sure they exist and maybe even might exist within your own experiences, ones that you might have created yourself or least maybe observed. So maybe I'll just say that the reason that I'm telling you this story to remind you that anyone can have a moment of ingenuity, one just has to remember to keep their wits about them in the middle of a problem and be willing to try new or different solutions to the problems that are faced, both technically and reality based.

Of course, this adds the question, do any of you have examples off the top of your head where technology and ingenuity come together?

Syndicate content