Tweet Dump: Ressuretion

A long time ago I had a small series of blog entries talking about using Python and MySQL to capture the Twitter public timeline. As I've hinted at in my CouchDB post last year, I was going to bring this topic back from the grave, this time using CouchDB instead of MySQL. After a lot of reading and testing, I can now share the fruits of this labor.

  1. import subprocess
  2. import re
  3. import sys
  4. import couchdb
  5. from couchdb.mapping import TextField, ListField
  6.  
  7. class Tweet(couchdb.Document):
  8. _id = TextField()
  9. _rev= TextField()
  10. tweet_data = ListField(TextField())
  11.  
  12. def get_tweets():
  13. final_texts = []
  14. command = """curl <a href="http://twitter.com/statuses/public_timeline.xml" title="http://twitter.com/statuses/public_timeline.xml">http://twitter.com/statuses/public_timeline.xml</a> | grep -A 3 '<status>' """
  15. process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
  16. (proc, error) = process.communicate()
  17. sys.stdout.flush()
  18.  
  19. if error:
  20. print(error)
  21. sys.exit(1)
  22.  
  23. ids = re.findall('[0-9]{10,13}', proc)
  24. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  25.  
  26. #check that the number of tweets ='s the number of ids
  27. # strip <text> fields from front & read
  28. if len(ids) == len(texts):
  29. for i in texts:
  30. final_texts.append(i.rstrip('</text>').lstrip('<text>'))
  31.  
  32. return(ids, final_texts)
  33.  
  34. else:
  35. return (0,0)
  36.  
  37. if __name__ == "__main__":
  38. #using local couchdb server for now
  39. server = couchdb.client.Server()
  40.  
  41. (ids, tweets) = get_tweets()
  42.  
  43. if ids == 0 or tweets == 0:
  44. print("Mismatch on count between id's and tweets.")
  45. sys.exit(1)
  46.  
  47. # Test to see if the db exists, create if it doesn't
  48. try:
  49. db = server['tweet_dumps']
  50. except couchdb.http.ResourceNotFound:
  51. db = server.create('tweet_dumps')
  52.  
  53. for i in range(len(ids)):
  54. try:
  55. rev = db[ids[i]].rev
  56. db_tweets = db[ids[i]].values()[0]
  57.  
  58. # to get rid of duplicate entries, which happen more
  59. # often than you think.
  60. if tweets[i] not in db_tweets:
  61. db_tweets.append(tweets[i])
  62.  
  63. db.update([Tweet( _id = ids[i], _rev = rev,
  64. tweet_data = db_tweets)])
  65.  
  66. except couchdb.http.ResourceNotFound:
  67. db.save(Tweet(_id = ids[i], tweet_data = [tweets[i]]))

To be frank, this started off as a copy and paste project. All the CouchDB code was copied & pasted from the previous CouchDB post and the tweet grabbing code was left over from one of the old tweet dump scripts. Obviously some of the original code has changed as the Tweet class is a little different, the database name is different, and one or two other things have changed.

One of the things that really surprised me about doing this project now as opposed to over a year ago was the amount of duplicates I captured. The last time I did this, I didn’t get a single duplicate in the public timeline. Now, in just one 24 hour capture I had one “tweeter” tweet the same tweet 118 times. That is why there is code in there for not appending duplicates (lines 69 and 70). I don't want to see the same tweet 118 times, nor do I want to store it. I know space is cheap, but I don't want to “pay” for keeping 118 copies of the same thing.

I will fully admit at this point that I found those 118 tweets by one person just by doing a little mouse clicking through the CouchDB web interface. I haven't yet figured out how to use the particular reduce function to find which ID wrote the most tweets. That will more than likely be the next blog post in this series.

After some time and reviewing the results of my capturing, I decided to modify the code a little, this time including a time stamp for each tweet captured (differences only pasted below):

  1. class Tweet(couchdb.Document):
  2. _id = TextField()
  3. _rev= TextField()
  4. tweet_data = DictField(Mapping.build(
  5. datetime = DateTimeField(),
  6. text = TextField()
  7. ))
  8.  
  9. ...
  10.  
  11.  
  12. for i in range(len(ids)):
  13. try:
  14. rev = db[ids[i]].rev
  15. db_tweets_dict = db[ids[i]].values()[0]
  16. db_tweets_dict[str(datetime.datetime.now())] = tweets[i]
  17. db.update([Tweet( _id = ids[i], _rev = rev,
  18. tweet_data = db_tweets_dict)])
  19.  
  20. except couchdb.http.ResourceNotFound:
  21. db.save(Tweet(_id = ids[i], tweet_data = {
  22. str(datetime.datetime.now()):tweets[i]}))

As you can see, there are some subtle differences between the two scripts. One important difference is the shell out command was changed; I used an extra grep to help reduce the data that python has to process. I did this to reduce a lot of id & tweet mismatch counts I was getting. This seemed to work so I stuck with it. The most important difference is inside the Tweet class; the Tweet class was changed from a list of TextFields to a DictField that houses a DateTimeField and a TextField. The other serious difference is the code to update the tweet_data variable, as there’s different code used to update a list data type as opposed to a dictionary data type.Otherwise these two scripts are exactly the same.

This does lead me to question how Twitter views, perceives, or deals with its public timeline. I alsowonder how accurate the portrayal of the public timeline is in relation to Twitter usage. If the public timeline is not an accurate portrayal of Twitter usage, then what is the point? But if it is, then maybe people aren't using the service as much as Twitter wants people to think they are.

--PS, sorry about the geshi putting in the html tag in the curl command above. I'm trying to fix that right now.

Tweet ID Length

You're looking for at most 13 digits in a tweet ID, aren't you? IDs are currently 17 digits and that number is sure to rise. Could that account for the duplicates? Better to parse the XML properly.
-Bill

Hey Bill, Thanks for the

Hey Bill,

Thanks for the comment. I originally wrote that regex somewhere around a year ago and will admit that I haven't been keeping up with the length of the Twitter ID number. So I'll have to admit to not modernizing the script.

Funny you should talk about Parsing the XML properly, my next post is regarding exactly that. I've rewritten the script to use urllib & xml.etree.ElementTree instead of it's current shell out, grep, and regex approach.So keep an eye out for that next week.

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share