Twitter

Tweet Dump: Resurrection part 2

I don't know about you, but when I hear or read the same thing three or so times from random sources, I pay attention. And the pattern to each of these comments have been about one thing: in the previous post in this thread I did a shell out to curl from Twitter. Not only that, but I used regular expressions for parsing xml. I won't deny it... I made some pretty bad mistakes. The only consolation I have regarding these mistakes is that I made them over a year and a half ago when I was just starting to learn Python and not aware of just how many libraries the standard install includes.

To help with the learning process, I'm going to show the original version as well as the “fixed” version.

Original:

  1. def get_tweets():
  2. final_texts = []
  3. command = "curl http://twitter.com/statuses/public_timeline.xml | grep -A 3 '<status>' "
  4. process = subprocess.Popen(command, stdout=subprocess.PIPE, error=subprocess.PIPE, shell=True)
  5. (proc, error) = process.communicate()
  6. sys.stdout.flush()
  7.  
  8. if error:
  9. print(error)
  10. sys.exit(1)
  11.  
  12. ids = re.findall('[0-9]{10,13}', proc)
  13. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  14.  
  15. #check that the number of tweets ='s the number of ids
  16. # strip <text> fields from front & read
  17. if len(ids) == len(texts):
  18. final_texts = [ i.rstrip('</text>').lstrip('<text>') for i in texts]
  19.  
  20. return(ids, final_texts)
  21.  
  22. else:
  23. return (0,0)

Fixed:

  1. def get_tweets():
  2. ids = []
  3. texts = []
  4. try:
  5. xmltree = xml.etree.ElementTree.parse(
  6. urllib.urlopen(
  7. 'http://twitter.com/statuses/public_timeline.xml'
  8. ))
  9. except:
  10. return (0, 0)
  11.  
  12. for x in xmltree.getiterator("status"):
  13. ids.append(x.find("id").text)
  14. texts.append(x.find("text").text)
  15.  
  16. return (ids, texts)

For starters the changes I made were ditching the shell out to curl and got the data from Twitter using the urllib library. Since I was grabbing xml from Twitter, the output from the urllib.urlopen function could then very easily be parsed and sorted by the xml.etree.ElementTree.parse function. And since I had all the data in an xml object, I could very easily get both the tweet text and the Twitter ID number.

I don't think I can stress enough how much cleaner the code is to read in the fixed version. I feel that part of the cleanliness comes from using the built-in libraries instead of trying to hack something together. As an added bonus, since the code uses the Python built-in libraries this code can now run on multiple platforms.

So there you have it, Internets. Once again I have wronged you by making a mistake and have only now gotten around to understanding how horrible of a mistake I made. Can you forgive me? Thank you.

For the super observant of you, one might notice that I also fixed a bug from the original version of get_Tweets and the version from the last thread. Happy Hunting.

Tweet Dump: Ressuretion

A long time ago I had a small series of blog entries talking about using Python and MySQL to capture the Twitter public timeline. As I've hinted at in my CouchDB post last year, I was going to bring this topic back from the grave, this time using CouchDB instead of MySQL. After a lot of reading and testing, I can now share the fruits of this labor.

  1. import subprocess
  2. import re
  3. import sys
  4. import couchdb
  5. from couchdb.mapping import TextField, ListField
  6.  
  7. class Tweet(couchdb.Document):
  8. _id = TextField()
  9. _rev= TextField()
  10. tweet_data = ListField(TextField())
  11.  
  12. def get_tweets():
  13. final_texts = []
  14. command = """curl <a href="http://twitter.com/statuses/public_timeline.xml" title="http://twitter.com/statuses/public_timeline.xml">http://twitter.com/statuses/public_timeline.xml</a> | grep -A 3 '<status>' """
  15. process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
  16. (proc, error) = process.communicate()
  17. sys.stdout.flush()
  18.  
  19. if error:
  20. print(error)
  21. sys.exit(1)
  22.  
  23. ids = re.findall('[0-9]{10,13}', proc)
  24. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  25.  
  26. #check that the number of tweets ='s the number of ids
  27. # strip <text> fields from front & read
  28. if len(ids) == len(texts):
  29. for i in texts:
  30. final_texts.append(i.rstrip('</text>').lstrip('<text>'))
  31.  
  32. return(ids, final_texts)
  33.  
  34. else:
  35. return (0,0)
  36.  
  37. if __name__ == "__main__":
  38. #using local couchdb server for now
  39. server = couchdb.client.Server()
  40.  
  41. (ids, tweets) = get_tweets()
  42.  
  43. if ids == 0 or tweets == 0:
  44. print("Mismatch on count between id's and tweets.")
  45. sys.exit(1)
  46.  
  47. # Test to see if the db exists, create if it doesn't
  48. try:
  49. db = server['tweet_dumps']
  50. except couchdb.http.ResourceNotFound:
  51. db = server.create('tweet_dumps')
  52.  
  53. for i in range(len(ids)):
  54. try:
  55. rev = db[ids[i]].rev
  56. db_tweets = db[ids[i]].values()[0]
  57.  
  58. # to get rid of duplicate entries, which happen more
  59. # often than you think.
  60. if tweets[i] not in db_tweets:
  61. db_tweets.append(tweets[i])
  62.  
  63. db.update([Tweet( _id = ids[i], _rev = rev,
  64. tweet_data = db_tweets)])
  65.  
  66. except couchdb.http.ResourceNotFound:
  67. db.save(Tweet(_id = ids[i], tweet_data = [tweets[i]]))

To be frank, this started off as a copy and paste project. All the CouchDB code was copied & pasted from the previous CouchDB post and the tweet grabbing code was left over from one of the old tweet dump scripts. Obviously some of the original code has changed as the Tweet class is a little different, the database name is different, and one or two other things have changed.

One of the things that really surprised me about doing this project now as opposed to over a year ago was the amount of duplicates I captured. The last time I did this, I didn’t get a single duplicate in the public timeline. Now, in just one 24 hour capture I had one “tweeter” tweet the same tweet 118 times. That is why there is code in there for not appending duplicates (lines 69 and 70). I don't want to see the same tweet 118 times, nor do I want to store it. I know space is cheap, but I don't want to “pay” for keeping 118 copies of the same thing.

I will fully admit at this point that I found those 118 tweets by one person just by doing a little mouse clicking through the CouchDB web interface. I haven't yet figured out how to use the particular reduce function to find which ID wrote the most tweets. That will more than likely be the next blog post in this series.

After some time and reviewing the results of my capturing, I decided to modify the code a little, this time including a time stamp for each tweet captured (differences only pasted below):

  1. class Tweet(couchdb.Document):
  2. _id = TextField()
  3. _rev= TextField()
  4. tweet_data = DictField(Mapping.build(
  5. datetime = DateTimeField(),
  6. text = TextField()
  7. ))
  8.  
  9. ...
  10.  
  11.  
  12. for i in range(len(ids)):
  13. try:
  14. rev = db[ids[i]].rev
  15. db_tweets_dict = db[ids[i]].values()[0]
  16. db_tweets_dict[str(datetime.datetime.now())] = tweets[i]
  17. db.update([Tweet( _id = ids[i], _rev = rev,
  18. tweet_data = db_tweets_dict)])
  19.  
  20. except couchdb.http.ResourceNotFound:
  21. db.save(Tweet(_id = ids[i], tweet_data = {
  22. str(datetime.datetime.now()):tweets[i]}))

As you can see, there are some subtle differences between the two scripts. One important difference is the shell out command was changed; I used an extra grep to help reduce the data that python has to process. I did this to reduce a lot of id & tweet mismatch counts I was getting. This seemed to work so I stuck with it. The most important difference is inside the Tweet class; the Tweet class was changed from a list of TextFields to a DictField that houses a DateTimeField and a TextField. The other serious difference is the code to update the tweet_data variable, as there’s different code used to update a list data type as opposed to a dictionary data type.Otherwise these two scripts are exactly the same.

This does lead me to question how Twitter views, perceives, or deals with its public timeline. I alsowonder how accurate the portrayal of the public timeline is in relation to Twitter usage. If the public timeline is not an accurate portrayal of Twitter usage, then what is the point? But if it is, then maybe people aren't using the service as much as Twitter wants people to think they are.

--PS, sorry about the geshi putting in the html tag in the curl command above. I'm trying to fix that right now.

Interesting use of Twitter


I've been a Twitter user for over year now and I still don't really understand it. Twitter seems like a great way to IM a group of people, and that is about all. Now I admit that I don't really get into it, and only once did I receive an answer to a question through Twitter. Otherwise it really just seems to me to be about self-aggrandizement. I'm not saying that is bad though, just making an observation.

That opinion changed a little bit recently. Last week I received an email from a job recruiter. (If any of my co-workers who actually read this want to know, I turned the job opportunity down.) However there was something out of the ordinary that caught my eye:

“For up to date Vivo job openings and other 'hopefully helpful' info
Please follow me at www.twitter.com/johnzinkvivo"

I followed the link and saw that what John is doing is putting up quick blurbs of job titles on Twitter, in addition to his other random tweets. And this struck me as a great way to use this medium. In fact I think this idea is such a good one that with proper the proper usage of hashtags and enough recruiters doing it, we could see an evolution in the online job hunting arena. There is less of a need to rely strictly on Dice, Monster, Cyber Coders, and if I dare say it, even Craigslist, when you could just post on Twitter. It's basically the same idea. With Dice and all, you post a job, and wait for the responses to come it. With Twitter you can do the same thing, post a tweet with a link for the job on your site, and wait.

It's not a perfect idea, and would probably need some form of filtering to make it worthwhile. But I do think that in time this evolution will happen. It makes sense to me. In the meantime, I am just happy that I have found for myself a “good” use of Twitter.

Tweet Dump part 3

Welcome back to another, and probably the last, instalment of the tweet dump project.

The old tweet dump code was done locally. I wanted to see how things would be affected if I ran the tweet dump code to a remote server. Just to get an idea on how travel would affect latency.

So I changed things up a bit, I created a sql database through my hosting provider and imported the schema into the remote database. At which point I also modified up the tweet dump code to use this new remote database. And after running the python script 5 times, here are my numbers:

Run 1 43.998
Run 2 46.352
Run 3 45.029
Run 4 55.024
Run 5 49.174
Average 47.92

In my tweetdump code I have a lot of going back and forth between the server and client;

  1. def getId(in_id):
  2. sql = "select id from ids where twit_id = '" + in_id + "'"
  3. return runQuery(sql)
  4.  
  5. def addId(in_id):
  6. sql= "insert into ids values (null, '" + in_id + "')"
  7. runQuery(sql)
  8.  
  9. def addTweet(in_id, in_tweet):
  10. sql = "insert into tweets values ('" + in_id + "',\"" + in_tweet + "\")"
  11. runQuery(sql)
  12. ...
  13. site_id = getId(ids[j])
  14. if( not site_id):
  15. addId(ids[j])
  16. site_id = getId(ids[j])

When you see the three functions making up the code block near the bottom, you might realize that there are three seperate sql calls to the server. And while this might be fine for a local database, this is not good for a remote one. So I decided to try my hand at creating a sql function to reduce some of the back and forth between the two entities.

  1. delimiter //
  2.  
  3. DROP FUNCTION IF EXISTS `get_set_ids`//
  4. CREATE FUNCTION get_set_ids( in_tweet_id BIGINT )
  5. RETURNS INT(10) UNSIGNED
  6. BEGIN
  7. DECLARE new_id INT(10) UNSIGNED;
  8. SELECT id INTO new_id FROM ids WHERE twit_id = in_tweet_id;
  9. IF ISNULL(new_id) THEN
  10. INSERT INTO ids VALUES (NULL, in_tweet_id);
  11. SELECT id INTO new_id FROM ids WHERE twit_id = in_tweet_id;
  12. END IF;
  13. RETURN new_id;
  14. END //
  15.  
  16. delimiter ;

The function above takes the guess work away from the client, and keeps it within the server. By doing this we avoid and entire round of communication between the client & server. So in theory we should see at least some kind of speed-up.

With the function above created I then went and modified the script to take advantage of the function:

  1. def get_set_id(in_id):
  2. sql = "select get_set_ids(" + in_id + ")"
  3. return runQuery(sql)
  4. ...
  5. site_id = get_set_id(ids[j])
  6. addTweet(str(site_id[0][0]), final_texts[j])

After a bit of testing to make sure things worked, I ran five timed tests (on the same hardware and from the same location, to try and reduce any variables that might crop up).

Run 1 25.948
Run 2 24.35
Run 3 26.181
Run 4 24.667
Run 5 25.352
Average 25.3

The difference between the two averages is about 22 seconds. And to be honest, I did not expect these changes to cut down my times by about half. This was a little bit of a shock to me. I guess in this end this is an example of how design can really matter.

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share

Syndicate content