CouchDB

Tweet Dump: Resurrection part 2

I don't know about you, but when I hear or read the same thing three or so times from random sources, I pay attention. And the pattern to each of these comments have been about one thing: in the previous post in this thread I did a shell out to curl from Twitter. Not only that, but I used regular expressions for parsing xml. I won't deny it... I made some pretty bad mistakes. The only consolation I have regarding these mistakes is that I made them over a year and a half ago when I was just starting to learn Python and not aware of just how many libraries the standard install includes.

To help with the learning process, I'm going to show the original version as well as the “fixed” version.

Original:

  1. def get_tweets():
  2. final_texts = []
  3. command = "curl http://twitter.com/statuses/public_timeline.xml | grep -A 3 '<status>' "
  4. process = subprocess.Popen(command, stdout=subprocess.PIPE, error=subprocess.PIPE, shell=True)
  5. (proc, error) = process.communicate()
  6. sys.stdout.flush()
  7.  
  8. if error:
  9. print(error)
  10. sys.exit(1)
  11.  
  12. ids = re.findall('[0-9]{10,13}', proc)
  13. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  14.  
  15. #check that the number of tweets ='s the number of ids
  16. # strip <text> fields from front & read
  17. if len(ids) == len(texts):
  18. final_texts = [ i.rstrip('</text>').lstrip('<text>') for i in texts]
  19.  
  20. return(ids, final_texts)
  21.  
  22. else:
  23. return (0,0)

Fixed:

  1. def get_tweets():
  2. ids = []
  3. texts = []
  4. try:
  5. xmltree = xml.etree.ElementTree.parse(
  6. urllib.urlopen(
  7. 'http://twitter.com/statuses/public_timeline.xml'
  8. ))
  9. except:
  10. return (0, 0)
  11.  
  12. for x in xmltree.getiterator("status"):
  13. ids.append(x.find("id").text)
  14. texts.append(x.find("text").text)
  15.  
  16. return (ids, texts)

For starters the changes I made were ditching the shell out to curl and got the data from Twitter using the urllib library. Since I was grabbing xml from Twitter, the output from the urllib.urlopen function could then very easily be parsed and sorted by the xml.etree.ElementTree.parse function. And since I had all the data in an xml object, I could very easily get both the tweet text and the Twitter ID number.

I don't think I can stress enough how much cleaner the code is to read in the fixed version. I feel that part of the cleanliness comes from using the built-in libraries instead of trying to hack something together. As an added bonus, since the code uses the Python built-in libraries this code can now run on multiple platforms.

So there you have it, Internets. Once again I have wronged you by making a mistake and have only now gotten around to understanding how horrible of a mistake I made. Can you forgive me? Thank you.

For the super observant of you, one might notice that I also fixed a bug from the original version of get_Tweets and the version from the last thread. Happy Hunting.

Tweet Dump: Ressuretion

A long time ago I had a small series of blog entries talking about using Python and MySQL to capture the Twitter public timeline. As I've hinted at in my CouchDB post last year, I was going to bring this topic back from the grave, this time using CouchDB instead of MySQL. After a lot of reading and testing, I can now share the fruits of this labor.

  1. import subprocess
  2. import re
  3. import sys
  4. import couchdb
  5. from couchdb.mapping import TextField, ListField
  6.  
  7. class Tweet(couchdb.Document):
  8. _id = TextField()
  9. _rev= TextField()
  10. tweet_data = ListField(TextField())
  11.  
  12. def get_tweets():
  13. final_texts = []
  14. command = """curl <a href="http://twitter.com/statuses/public_timeline.xml" title="http://twitter.com/statuses/public_timeline.xml">http://twitter.com/statuses/public_timeline.xml</a> | grep -A 3 '<status>' """
  15. process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
  16. (proc, error) = process.communicate()
  17. sys.stdout.flush()
  18.  
  19. if error:
  20. print(error)
  21. sys.exit(1)
  22.  
  23. ids = re.findall('[0-9]{10,13}', proc)
  24. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  25.  
  26. #check that the number of tweets ='s the number of ids
  27. # strip <text> fields from front & read
  28. if len(ids) == len(texts):
  29. for i in texts:
  30. final_texts.append(i.rstrip('</text>').lstrip('<text>'))
  31.  
  32. return(ids, final_texts)
  33.  
  34. else:
  35. return (0,0)
  36.  
  37. if __name__ == "__main__":
  38. #using local couchdb server for now
  39. server = couchdb.client.Server()
  40.  
  41. (ids, tweets) = get_tweets()
  42.  
  43. if ids == 0 or tweets == 0:
  44. print("Mismatch on count between id's and tweets.")
  45. sys.exit(1)
  46.  
  47. # Test to see if the db exists, create if it doesn't
  48. try:
  49. db = server['tweet_dumps']
  50. except couchdb.http.ResourceNotFound:
  51. db = server.create('tweet_dumps')
  52.  
  53. for i in range(len(ids)):
  54. try:
  55. rev = db[ids[i]].rev
  56. db_tweets = db[ids[i]].values()[0]
  57.  
  58. # to get rid of duplicate entries, which happen more
  59. # often than you think.
  60. if tweets[i] not in db_tweets:
  61. db_tweets.append(tweets[i])
  62.  
  63. db.update([Tweet( _id = ids[i], _rev = rev,
  64. tweet_data = db_tweets)])
  65.  
  66. except couchdb.http.ResourceNotFound:
  67. db.save(Tweet(_id = ids[i], tweet_data = [tweets[i]]))

To be frank, this started off as a copy and paste project. All the CouchDB code was copied & pasted from the previous CouchDB post and the tweet grabbing code was left over from one of the old tweet dump scripts. Obviously some of the original code has changed as the Tweet class is a little different, the database name is different, and one or two other things have changed.

One of the things that really surprised me about doing this project now as opposed to over a year ago was the amount of duplicates I captured. The last time I did this, I didn’t get a single duplicate in the public timeline. Now, in just one 24 hour capture I had one “tweeter” tweet the same tweet 118 times. That is why there is code in there for not appending duplicates (lines 69 and 70). I don't want to see the same tweet 118 times, nor do I want to store it. I know space is cheap, but I don't want to “pay” for keeping 118 copies of the same thing.

I will fully admit at this point that I found those 118 tweets by one person just by doing a little mouse clicking through the CouchDB web interface. I haven't yet figured out how to use the particular reduce function to find which ID wrote the most tweets. That will more than likely be the next blog post in this series.

After some time and reviewing the results of my capturing, I decided to modify the code a little, this time including a time stamp for each tweet captured (differences only pasted below):

  1. class Tweet(couchdb.Document):
  2. _id = TextField()
  3. _rev= TextField()
  4. tweet_data = DictField(Mapping.build(
  5. datetime = DateTimeField(),
  6. text = TextField()
  7. ))
  8.  
  9. ...
  10.  
  11.  
  12. for i in range(len(ids)):
  13. try:
  14. rev = db[ids[i]].rev
  15. db_tweets_dict = db[ids[i]].values()[0]
  16. db_tweets_dict[str(datetime.datetime.now())] = tweets[i]
  17. db.update([Tweet( _id = ids[i], _rev = rev,
  18. tweet_data = db_tweets_dict)])
  19.  
  20. except couchdb.http.ResourceNotFound:
  21. db.save(Tweet(_id = ids[i], tweet_data = {
  22. str(datetime.datetime.now()):tweets[i]}))

As you can see, there are some subtle differences between the two scripts. One important difference is the shell out command was changed; I used an extra grep to help reduce the data that python has to process. I did this to reduce a lot of id & tweet mismatch counts I was getting. This seemed to work so I stuck with it. The most important difference is inside the Tweet class; the Tweet class was changed from a list of TextFields to a DictField that houses a DateTimeField and a TextField. The other serious difference is the code to update the tweet_data variable, as there’s different code used to update a list data type as opposed to a dictionary data type.Otherwise these two scripts are exactly the same.

This does lead me to question how Twitter views, perceives, or deals with its public timeline. I alsowonder how accurate the portrayal of the public timeline is in relation to Twitter usage. If the public timeline is not an accurate portrayal of Twitter usage, then what is the point? But if it is, then maybe people aren't using the service as much as Twitter wants people to think they are.

--PS, sorry about the geshi putting in the html tag in the curl command above. I'm trying to fix that right now.

Tame the Snake to Sit on the Couch

A while ago I came across an aritcle that talked about using python with couchdb. Since I'm interested in learning about the whole NoSQL movement and am trying to understand where this new style of database server fits in the schema (sorry, I couldn't resist) of things I decided to spend a little time getting to know CouchDB better.

The system that I build this script on, tested with, and designed for is Ubuntu (I'm running 10.10 if anyone cares to know). Although I installed CouchDB through apt-get, I decided to get couchdb-python through easy-install. I did this because the website documentation for couchdb-python is written for the most recent version, and the apt-get version is a little out of date.

So after you've installed Couchdb and couchdb-python you can now run the script below:

  1. #!/usr/bin/python
  2. '''
  3. Simple script to start playing with the couchdb python package.
  4. '''
  5.  
  6. import os
  7. import subprocess
  8. import re
  9. import sys
  10. import couchdb
  11. from couchdb.mapping import TextField, ListField, DictField
  12.  
  13. class System(couchdb.Document):
  14. _id = TextField()
  15. _rev= TextField()
  16. uname = ListField(TextField())
  17. packages = DictField()
  18.  
  19. def get_packages():
  20. '''
  21. gather all of the packages on the system, and return a dictionary with
  22. the package name as the key, and the version as the value.
  23. '''
  24. process = subprocess.Popen("dpkg-query -W -f='${Package} ${Version}\n'", stdout=subprocess.PIPE, shell=True)
  25. (proc, error) = process.communicate()
  26. sys.stdout.flush()
  27.  
  28. if error:
  29. print(error)
  30. sys.exit(1)
  31.  
  32. proc_dict = {}
  33. for x in proc.splitlines():
  34. m = re.search('(?P<package>\S+)\W(?P<version>\S+)', x)
  35. proc_dict[m.group('package')] = m.group('version')
  36.  
  37. return proc_dict
  38.  
  39. def uname_list():
  40. '''
  41. Generate a list based on the data from uname, excluding the
  42. system's name.
  43. '''
  44. l = []
  45. l.append(os.uname()[0])
  46. for x in os.uname()[2:]:
  47. l.append(x)
  48.  
  49. return l
  50.  
  51. if __name__ == "__main__":
  52. box_name = os.uname()[1]
  53. # When not using any arguments, defaults to localhost
  54. server = couchdb.client.Server()
  55.  
  56. # Test to see if the db exists, create if it doesn't
  57. try:
  58. db = server['sys_info']
  59. except couchdb.http.ResourceNotFound:
  60. db = server.create('sys_info')
  61.  
  62. # test to see if the computer already exists in the db
  63. try:
  64. rev = db[box_name].rev
  65. db.update([System( _id = box_name, _rev = rev,
  66. uname = uname_list(), packages = get_packages())])
  67. except couchdb.http.ResourceNotFound:
  68. db.save(System(_id = box_name, uname = uname_list(),
  69. packages = get_packages()))

If the script ran without errors then you should be able to this URL and see your computers name as well a value of rev:1 – “some string of characters”. If you click on your computer's name, you'll see all the information the put inserted into the database. What you should see is the “_id” field which will contain the computer's name. A “_rev” field which will say the current revision number for this page. A list of all the packages installed on the system... in no particular order. Finally the output of uname, minus the computer name.
If the script did return some errors, what please make sure that you have module initialization arugments correct (line 54). Couchdb-python uses the defaults if there are no arguments, so in this case the module is going to localhost for the host and the admin account, which has no password. Yes, I know this is really unsafe but I'm just playing with things right now.

One of the things I feel worth pointing out in the System class, in order to create a ListField for CouchDB you must specify what the ListField will contain. In this case (line 16) I am filling the ListField with TextFields. Or in Python speak, filling the list with strings.

I'm surprisingly fascinated with CouchDB... though I'm really hard pressed to say why. At this moment I want to modify my tweet_dump project to use CouchDB. So expect to see that series continued in a little while.

One last thing, for those of you who celebrate it, Happy Thanksgiving!

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share

Syndicate content