Tweet Dump: Resurrection part 2

I don't know about you, but when I hear or read the same thing three or so times from random sources, I pay attention. And the pattern to each of these comments have been about one thing: in the previous post in this thread I did a shell out to curl from Twitter. Not only that, but I used regular expressions for parsing xml. I won't deny it... I made some pretty bad mistakes. The only consolation I have regarding these mistakes is that I made them over a year and a half ago when I was just starting to learn Python and not aware of just how many libraries the standard install includes.

To help with the learning process, I'm going to show the original version as well as the “fixed” version.

Original:

  1. def get_tweets():
  2. final_texts = []
  3. command = "curl http://twitter.com/statuses/public_timeline.xml | grep -A 3 '<status>' "
  4. process = subprocess.Popen(command, stdout=subprocess.PIPE, error=subprocess.PIPE, shell=True)
  5. (proc, error) = process.communicate()
  6. sys.stdout.flush()
  7.  
  8. if error:
  9. print(error)
  10. sys.exit(1)
  11.  
  12. ids = re.findall('[0-9]{10,13}', proc)
  13. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  14.  
  15. #check that the number of tweets ='s the number of ids
  16. # strip <text> fields from front & read
  17. if len(ids) == len(texts):
  18. final_texts = [ i.rstrip('</text>').lstrip('<text>') for i in texts]
  19.  
  20. return(ids, final_texts)
  21.  
  22. else:
  23. return (0,0)

Fixed:

  1. def get_tweets():
  2. ids = []
  3. texts = []
  4. try:
  5. xmltree = xml.etree.ElementTree.parse(
  6. urllib.urlopen(
  7. 'http://twitter.com/statuses/public_timeline.xml'
  8. ))
  9. except:
  10. return (0, 0)
  11.  
  12. for x in xmltree.getiterator("status"):
  13. ids.append(x.find("id").text)
  14. texts.append(x.find("text").text)
  15.  
  16. return (ids, texts)

For starters the changes I made were ditching the shell out to curl and got the data from Twitter using the urllib library. Since I was grabbing xml from Twitter, the output from the urllib.urlopen function could then very easily be parsed and sorted by the xml.etree.ElementTree.parse function. And since I had all the data in an xml object, I could very easily get both the tweet text and the Twitter ID number.

I don't think I can stress enough how much cleaner the code is to read in the fixed version. I feel that part of the cleanliness comes from using the built-in libraries instead of trying to hack something together. As an added bonus, since the code uses the Python built-in libraries this code can now run on multiple platforms.

So there you have it, Internets. Once again I have wronged you by making a mistake and have only now gotten around to understanding how horrible of a mistake I made. Can you forgive me? Thank you.

For the super observant of you, one might notice that I also fixed a bug from the original version of get_Tweets and the version from the last thread. Happy Hunting.

Other suggestion

The biggest issue with the new code is the bare except: it captures Ctrl-C etc, at the very least do a "except Exception". Better yet only capture the exceptions thrown by connection and parsing problems.

The pattern mylist=[]+for loop+mylist.append is prime candidate for a list comprehension:

  1. tweets = []
  2. for x in xmltree.getiterator("status"):
  3. tweets.append((x.find("id").text, x.find("text").text))
  4. return tweets

to this
  1. return [x.find("id").text, x.find("text").text
  2. for x in xmltree.getiterator("status")]

But the explicit loop is perfectly fine too, since it is very readable.

Another slight improvement

Hey mate,

Nice improvements. I think that another good thing to do would be to remove the tuple of arrays approach. Instead of having two arrays which share information via indexes, it would instead be better to return an array of tuples so that you don't have to worry about indexes.

A-la:

  1. tweets = []
  2. .
  3. .
  4. for x in xmltree.getiterator("status"):
  5. tweets.append((x.find("id").text, x.find("text").text))
  6.  
  7. return tweets

Bear in mind I'm a Python noob, so I'm not sure if this syntax is correct, but hopefully it gives the idea.

Cheers!

How about yield?

At line 12 you might also try:
for x in xmltree.getiterator("status"):
  yield(x.find("id").text, x.find("text").text)

And in the calling code:
tweets = dict(get_tweets())

I think you'll find it even easier to populate your CouchDB out of a Python dictionary.

Yield!

I had no idea Python had yield. Good suggestion. Laziness for the win.

Thanks for the post Anon. I

Thanks for the post Anon. I think that both you and OJ are right for using a different datatype than my current dual list setup. At the moment I am more partial towards the dictionary solution.

That being said, I don't see at the moment what the benefit of using yield would be in this case?

Arguably more Pythonic

yield doesn't buy you a whole lot in this case where the sequence size to be returned is small. Nevertheless, it's a good idea to keep it on your radar because you well may have cases where large sequences consume more memory than you'd like. Furthermore, using yield (which turns your function definition into a generator definition, BTW) is considered by many to be more Pythonic, or at least more idiomatic on Python versions in the last, say, five years.

Lastly, the spacing in my original post made it look like yield is a function by virtue of the opening paren being adjacent to the token "yield". It's actually a keyword, and you don't actually need the parens at all:

$ python32
Python 3.2 (r32:88445, Apr 12 2011, 09:28:14)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> def foo():
...    yield 1,2
...
>>> dict(foo())
{1: 2}
>>>

The example above works identically in Python 2.4.

Have fun!
James

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share