Mysql

MySQL for Python

Author: Albert Lukaszewski
Publisher: Packt

MySQL for Python is a good book for any beginning to intermediate Python programmers who want or need to get their minds and hands on a crash course in SQL programming and MySQL administration.

I base the opinion above on the fact that I am someone who is comfortable with Python, but does not code in it often during my job. I am also someone who can work his way around a MySQL database without having to research every SQL command; my knowledge of Python and MySQL is functional, but not expert.

The book’s format is one that I would think someone who didn't know what they were doing with MySQL would find helpful. Each chapter is dedicated to performing a certain set of MySQL tasks; for example Chapter 7 is titled “Creating and Dropping” (as in tables). The chapter opens with a quick discussion about what information the chapter is supposed to convey. Then it moves on to which commands one will use in MySQL to perform the desired tasks. After these commands are discussed, the book then goes into how to perform the same commands in Python. Each chapter ends with a project that is supposed to challenge the reader to create something that exemplifies the chapter's topic.

One area where I find the book lacking is that there is no mention of Python3 or the changes that one's code would require to make it compatible with Python3 vs Python. For example, there are a couple of code examples in the book that heavily rely on the function xrange(). Not that xrange() is bad function to use, but that particular function doesn’t exist in Python 3. I could theorize that maybe the author didn’t want to complicate the idea of using Python as a tool to learning SQL, but it does limit the amount of time that the content stays relevant.

A part of the book that I did find quite good was the quick and easily understandable explanation of the various join functions in Chapter 13. Although the descriptions were a little short, the brevity was acceptable because of the included Venn diagrams to help illustrate the how the joins worked logically. This was a great touch for helping someone learn these concepts; it was also the first time I’ve seen Venn diagrams used to teach someone about joins.

I recognize I might have been overly harsh on this book review, but that doesn’t mean I wouldn’t recommend it for the right person. I feel the right person for this is someone who is already comfortable with Python and wants to expand their knowledge to include the basics of MySQL administration and SQL programming.

Tweet Dump part 3

Welcome back to another, and probably the last, instalment of the tweet dump project.

The old tweet dump code was done locally. I wanted to see how things would be affected if I ran the tweet dump code to a remote server. Just to get an idea on how travel would affect latency.

So I changed things up a bit, I created a sql database through my hosting provider and imported the schema into the remote database. At which point I also modified up the tweet dump code to use this new remote database. And after running the python script 5 times, here are my numbers:

Run 1 43.998
Run 2 46.352
Run 3 45.029
Run 4 55.024
Run 5 49.174
Average 47.92

In my tweetdump code I have a lot of going back and forth between the server and client;

  1. def getId(in_id):
  2. sql = "select id from ids where twit_id = '" + in_id + "'"
  3. return runQuery(sql)
  4.  
  5. def addId(in_id):
  6. sql= "insert into ids values (null, '" + in_id + "')"
  7. runQuery(sql)
  8.  
  9. def addTweet(in_id, in_tweet):
  10. sql = "insert into tweets values ('" + in_id + "',\"" + in_tweet + "\")"
  11. runQuery(sql)
  12. ...
  13. site_id = getId(ids[j])
  14. if( not site_id):
  15. addId(ids[j])
  16. site_id = getId(ids[j])

When you see the three functions making up the code block near the bottom, you might realize that there are three seperate sql calls to the server. And while this might be fine for a local database, this is not good for a remote one. So I decided to try my hand at creating a sql function to reduce some of the back and forth between the two entities.

  1. delimiter //
  2.  
  3. DROP FUNCTION IF EXISTS `get_set_ids`//
  4. CREATE FUNCTION get_set_ids( in_tweet_id BIGINT )
  5. RETURNS INT(10) UNSIGNED
  6. BEGIN
  7. DECLARE new_id INT(10) UNSIGNED;
  8. SELECT id INTO new_id FROM ids WHERE twit_id = in_tweet_id;
  9. IF ISNULL(new_id) THEN
  10. INSERT INTO ids VALUES (NULL, in_tweet_id);
  11. SELECT id INTO new_id FROM ids WHERE twit_id = in_tweet_id;
  12. END IF;
  13. RETURN new_id;
  14. END //
  15.  
  16. delimiter ;

The function above takes the guess work away from the client, and keeps it within the server. By doing this we avoid and entire round of communication between the client & server. So in theory we should see at least some kind of speed-up.

With the function above created I then went and modified the script to take advantage of the function:

  1. def get_set_id(in_id):
  2. sql = "select get_set_ids(" + in_id + ")"
  3. return runQuery(sql)
  4. ...
  5. site_id = get_set_id(ids[j])
  6. addTweet(str(site_id[0][0]), final_texts[j])

After a bit of testing to make sure things worked, I ran five timed tests (on the same hardware and from the same location, to try and reduce any variables that might crop up).

Run 1 25.948
Run 2 24.35
Run 3 26.181
Run 4 24.667
Run 5 25.352
Average 25.3

The difference between the two averages is about 22 seconds. And to be honest, I did not expect these changes to cut down my times by about half. This was a little bit of a shock to me. I guess in this end this is an example of how design can really matter.

Tweet Dump pt 2 : Statistics

Some of you might remember my post last week titled Tweet Dump. And some of you might recall that I didn't include any statistics regarding my findings. That was my bad, and I apologize. I'll do my best to not do that in the future. So without taking up any more of your time let's get right to the numbers.

So here is the SQL I used to check for the amount of ReTweets:
 select count(*) from tweets where tweet like '%RT%';
which returned:
1791 out of 50234, or 3.56%.

The SQL for tweets to at least one person:
 select count(*) from tweets where tweet like '%@%';
which returned:
20984 or 41.78%

The SQL for tweets with http links:
 select count(*) from tweets where tweet like '%http://%';
returned:
9154 or 18.23%

This is the SQL I used to try and find any retweets:
 select ids, count (*) as cnt from tweets group by ids having cnt > 1;
And sadly I get 0. Its a little weird that I can capture almost fifty one thousand tweets and not one person tweeting more than once. I wanna try again with this, but with a smaller window of time. Maybe people are less likely to tweet more than once in a window. I don't know, I'm just curious. Could something with my SQL. Anyone see anything wrong with it?

And finally the SQL for hashtags:
 select count(*) from tweets where tweet like '%#%';
which returned 15117 or 30.09%

However, the SQL above also grabbed things like this:

"FACTのPVおちゃめで楽しい。これ見た後だと他"

Which are not incorrect per se. But I wasn't really wanting to include characters from foreign languages in my search. Thus I'll need to come up with a more defined SQL query that will parse tweets like this out of my count.

Now some of you more detailed readers might have noticed that the percentages only add up to 93.66%. The only meaning I could say to that is maybe, just maybe, 6.34% is just regular tweets. Obviously, I'm not doing a very in depth search, as there are probably tweets that will be directed to someone and include a link. Which my numbers don't really take into consideration. And if you really want to know these kinds of numbers, ask a statistician. I'm a programmer. :P

Again, I'm not quite done with this project. But I thought I would take the time to share my findings. I hope you found them as interesting as I did.

Tweet Dump

So one day I was browsing the API of Twitter. And on this particular page on Twitter's website, they talked using a command line utility to interact with Twitter. I was intrigued by this, so I decided to write up a simple script to deposit the tweets and twitter ID's into a database. Why? Because I could. ;)

So if for whatever reason you have to recreate my little experiment, you'll need a couple of things:


    1) A unix shell
      1.1) grep

    2)python installed and working
      2.1) the mysqldb python module installed

    3)mysql database installed

So here is my MySQL schema:

  1. -- MySQL dump 10.11
  2. --
  3. -- Host: localhost Database: twit_bank
  4. -- ------------------------------------------------------
  5. -- Server version 5.0.84-log
  6.  
  7. /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
  8. /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
  9. /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
  10. /*!40101 SET NAMES utf8 */;
  11. /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
  12. /*!40103 SET TIME_ZONE='+00:00' */;
  13. /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
  14. /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
  15. /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
  16. /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
  17.  
  18. --
  19. -- Table structure for table `ids`
  20. --
  21.  
  22. DROP TABLE IF EXISTS `ids`;
  23. /*!40101 SET @saved_cs_client = @@character_set_client */;
  24. /*!40101 SET character_set_client = utf8 */;
  25. CREATE TABLE `ids` (
  26. `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  27. `twit_id` BIGINT(20) DEFAULT NULL,
  28. UNIQUE KEY `id` (`id`)
  29. ) ENGINE=MyISAM AUTO_INCREMENT=50250 DEFAULT CHARSET=latin1;
  30. /*!40101 SET character_set_client = @saved_cs_client */;
  31.  
  32. --
  33. -- Table structure for table `tweets`
  34. --
  35.  
  36. DROP TABLE IF EXISTS `tweets`;
  37. /*!40101 SET @saved_cs_client = @@character_set_client */;
  38. /*!40101 SET character_set_client = utf8 */;
  39. CREATE TABLE `tweets` (
  40. `ids` INT(10) UNSIGNED DEFAULT NULL,
  41. `tweet` CHAR(150) DEFAULT NULL
  42. ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  43. /*!40101 SET character_set_client = @saved_cs_client */;
  44. /*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;
  45.  
  46. /*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
  47. /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
  48. /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
  49. /*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
  50. /*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
  51. /*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
  52. /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
  53.  
  54. -- Dump completed on 2009-10-19 0:22:26

And once you have that schema install, you can use the script below (Obviously you'll need to change the username and password to your particular database, but you don't need me to tell you that. My readers are S-M-A-R-T!)

  1. #!/usr/bin/python
  2.  
  3. import os
  4. import subprocess
  5. import MySQLdb
  6. import re
  7. import sys
  8.  
  9. #variables
  10. databaseUser = "*****"
  11. databasePassword = "*****"
  12. databaseName = "twit_bank"
  13.  
  14.  
  15. def runQuery(query):
  16.  
  17. db = MySQLdb.connect(user=databaseUser, passwd=databasePassword, db=databaseName)
  18. c = db.cursor()
  19. c.execute(query)
  20. db.commit()
  21.  
  22. return c.fetchall()
  23.  
  24. def getId(in_id):
  25. sql = "select id from ids where twit_id = '" + in_id + "'"
  26. return runQuery(sql)
  27.  
  28. def addId(in_id):
  29. sql= "insert into ids values (null, '" + in_id + "')"
  30. runQuery(sql)
  31.  
  32. def addTweet(in_id, in_tweet):
  33. sql = "insert into tweets values ('" + in_id + "',\"" + in_tweet + "\")"
  34. runQuery(sql)
  35.  
  36. def main():
  37.  
  38. final_texts = list()
  39. j = 0
  40. site_id = list()
  41.  
  42. #code structure grabbed from:
  43. #http://buffis.com/2007/01/09/getting-output-of-a-system-call-in-python/comment-page-1/
  44.  
  45. command = "curl <a href="http://twitter.com/statuses/public_timeline.xml" title="http://twitter.com/statuses/public_timeline.xml">http://twitter.com/statuses/public_timeline.xml</a> | grep -A 3 '<status>' "
  46. process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
  47. os.waitpid(process.pid, 0)
  48. curl = process.stdout.read().strip()
  49. sys.stdout.flush()
  50.  
  51. ids = re.findall("[0-9]{10,13}", curl)
  52. texts = re.findall("<text>[\x20-\x7E]{1,150}", curl)
  53.  
  54. if ( len(ids) == len(texts)):
  55. for i in texts:
  56. temp = i.lstrip('<text>')
  57. final_texts.append(temp.rstrip('</text>'))
  58.  
  59. while( j < 19):
  60. site_id = getId(ids[j])
  61. if( not site_id):
  62. addId(ids[j])
  63. site_id = getId(ids[j])
  64.  
  65. addTweet(str(site_id[0][0]), final_texts[j])
  66. j = j + 1
  67.  
  68. main()

And this final last piece to put it all together, so you don't have to do all the grabbing manually.(Don't forget to change the location of the script.)

  1. #!/bin/bash
  2.  
  3. while [ 1 ]
  4. do
  5. /home/bryce/programming/personal/scripts/python/twit2mysql.py
  6. sleep 30
  7. done

Now we get to the fun part. After about 24 hours of grabbing the public timeline at 40 tweets a minute I got 50249 ID's and 50234 tweets on record. Now while watching the numbers grow, there was a weird issue. Most of the time there were more ID's than tweets. I'm not sure why this is at the moment, but I'm curious enough to try and solve this bug.

Oh, and something that people should probably know. The reason why I'm only grabbing 40 tweets a minute is because the first time I tried to run this, I was grabbing more like 120 tweets a minute. And I was blocked by Twitter for viewing the public timeline more than 150 times in one hour. And I was blocked for the rest of the hour. So consider that your warning.

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share

Syndicate content