The Devil is in the Details

This story happened a while ago, but because I've been so busy I haven't really had the time to put it on (electronic) paper. While hanging out on Twitter I saw a message asking why one version of Haskell code was preferred over another. So I followed the link and saw the code below (sorry for the formatting issues. Geshi isn't very good with http links):

1) From http://learnyouahaskell.com/recursion#a-few-more-recursive-functions

  1. take' :: (Num i, Ord i) => i -> [a] -> [a]
  2. take' n _
  3. | n <= 0 = []
  4. take' _ [] = []
  5. take' n (x:xs) = x : take' (n-1) xs

vs.

2) Just playin

  1. take' :: (Num i, Ord i) => i -> [a] -> [a]
  2. take' 0 _ = []
  3. take' _ [] = []
  4. take' n (x:xs) = x : take' (n-1) xs

Freefromz, the author of the code, was trying to do some code reduction. And while his version does look a little nicer, the lack of error checking makes the function a little incomplete and problematic. I noticed this right away, but it took a little while to get the idea across over Twitter; it's a bit difficult to relay complex concepts in 140 characters or less. Ultimately it was the tweet I sent him that said, “@freeformz right. Let me ask you this question, in #2 what happens it I input -2 for n?” to which he replied, “@bryceverdier it fails. duh on me. ;-)” that turned the light on for him.

This was my first time troubleshooting someone else's Haskell code. I'm really grateful that it was an easy one, because if it was any more complicated it probably would have been too difficult for my current Haskell programming abilities--something I intend to work on improving now that I have purchased my own copy of Real World Haskell. YAY!

Tweet Dump: Resurrection part 2

I don't know about you, but when I hear or read the same thing three or so times from random sources, I pay attention. And the pattern to each of these comments have been about one thing: in the previous post in this thread I did a shell out to curl from Twitter. Not only that, but I used regular expressions for parsing xml. I won't deny it... I made some pretty bad mistakes. The only consolation I have regarding these mistakes is that I made them over a year and a half ago when I was just starting to learn Python and not aware of just how many libraries the standard install includes.

To help with the learning process, I'm going to show the original version as well as the “fixed” version.

Original:

  1. def get_tweets():
  2. final_texts = []
  3. command = "curl http://twitter.com/statuses/public_timeline.xml | grep -A 3 '<status>' "
  4. process = subprocess.Popen(command, stdout=subprocess.PIPE, error=subprocess.PIPE, shell=True)
  5. (proc, error) = process.communicate()
  6. sys.stdout.flush()
  7.  
  8. if error:
  9. print(error)
  10. sys.exit(1)
  11.  
  12. ids = re.findall('[0-9]{10,13}', proc)
  13. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  14.  
  15. #check that the number of tweets ='s the number of ids
  16. # strip <text> fields from front & read
  17. if len(ids) == len(texts):
  18. final_texts = [ i.rstrip('</text>').lstrip('<text>') for i in texts]
  19.  
  20. return(ids, final_texts)
  21.  
  22. else:
  23. return (0,0)

Fixed:

  1. def get_tweets():
  2. ids = []
  3. texts = []
  4. try:
  5. xmltree = xml.etree.ElementTree.parse(
  6. urllib.urlopen(
  7. 'http://twitter.com/statuses/public_timeline.xml'
  8. ))
  9. except:
  10. return (0, 0)
  11.  
  12. for x in xmltree.getiterator("status"):
  13. ids.append(x.find("id").text)
  14. texts.append(x.find("text").text)
  15.  
  16. return (ids, texts)

For starters the changes I made were ditching the shell out to curl and got the data from Twitter using the urllib library. Since I was grabbing xml from Twitter, the output from the urllib.urlopen function could then very easily be parsed and sorted by the xml.etree.ElementTree.parse function. And since I had all the data in an xml object, I could very easily get both the tweet text and the Twitter ID number.

I don't think I can stress enough how much cleaner the code is to read in the fixed version. I feel that part of the cleanliness comes from using the built-in libraries instead of trying to hack something together. As an added bonus, since the code uses the Python built-in libraries this code can now run on multiple platforms.

So there you have it, Internets. Once again I have wronged you by making a mistake and have only now gotten around to understanding how horrible of a mistake I made. Can you forgive me? Thank you.

For the super observant of you, one might notice that I also fixed a bug from the original version of get_Tweets and the version from the last thread. Happy Hunting.

Programming Praxis – Two Kaprekar Exercises

Sorry I haven't written in a while. Been rather busy with life recently. I'm still planning on continuing my Tweet Dump series, and I will post that up soon. One of the reasons for the delay is because I'm learning the Colemak keyboard layout has slowed down my typing quite a lot this week.

Anyway, yesterday's Programming Praxis question goes,
For today’s exercise we return to the world of recreational mathematics with two exercises due to the Indian mathematician Dattaraya Ramchandra Kaprekar. First we compute Kaprekar chains:

1. Choose any four-digit number, with at least two different digits. Leading zeros are permitted.

2. Arrange the digits into two numbers, one with the digits sorted into ascending order, the other with the digits sorted into descending order.

3. Subtract the smaller number from the larger number.

4. Repeat until the number is 6174. At that point, the process will cycle with 7641 − 1467 = 6174.

For instance, starting with 2011, the chain is 2110 − 112 = 1998, 9981 − 1899 = 8082, 8820 − 288 = 8532, and 8532 − 2358 = 6174.

The second exercise determines if a number is a Kaprekar number, defined as an n-digit number such that, when it is squared, the sum of the first n or n−1 digits and the last n digits is the original number. For instance, 703 is a Kaprekar number because 7032 = 494209 and 494 + 209 = 703.

So here is the code I wrote and submitted to the comments section. I will happily admit (like I did in my comment) that my isKaprekar function is a modified version of one I saw in the comments here as it was cleaner than my first version and I wanted to try out the "int(s[:-sz] or 0)" expression)

  1. #!/usr/bin/python3
  2.  
  3. import itertools
  4.  
  5. def isKaprekar(number):
  6. square = str(number ** 2)
  7. numlen = len(str(number))
  8. return number == int(square[:-numlen] or 0) + int(square[-numlen:])
  9.  
  10. def keprekar_chain(number):
  11. retlist = [number]
  12. if len(set(str(number))) > 2:
  13. while retlist[-1] != 6174:
  14. pers = [int(''.join(x)) for x in
  15. itertools.permutations(str(retlist[-1]))]
  16. retlist.append(max(pers) - min(pers))
  17. return retlist
  18. else:
  19. return []
  20.  
  21.  
  22. if __name__ == "__main__":
  23. print('Keprekar numbers from 1 to 1000:')
  24. print(*[x for x in range(1,1001) if isKaprekar(x)])
  25.  
  26. print('Longest chain between 1000 and 9999')
  27. kep_list = []
  28. for x in range(1000,10000):
  29. tlist = keprekar_chain(x)
  30. kep_list.append((len(tlist), tlist))
  31.  
  32. print(sorted(kep_list, key= lambda x: x[0], reverse=True)[0])

That's all for now; more to show up once I can type at normal speeds again.

Tweet Dump: Ressuretion

A long time ago I had a small series of blog entries talking about using Python and MySQL to capture the Twitter public timeline. As I've hinted at in my CouchDB post last year, I was going to bring this topic back from the grave, this time using CouchDB instead of MySQL. After a lot of reading and testing, I can now share the fruits of this labor.

  1. import subprocess
  2. import re
  3. import sys
  4. import couchdb
  5. from couchdb.mapping import TextField, ListField
  6.  
  7. class Tweet(couchdb.Document):
  8. _id = TextField()
  9. _rev= TextField()
  10. tweet_data = ListField(TextField())
  11.  
  12. def get_tweets():
  13. final_texts = []
  14. command = """curl <a href="http://twitter.com/statuses/public_timeline.xml" title="http://twitter.com/statuses/public_timeline.xml">http://twitter.com/statuses/public_timeline.xml</a> | grep -A 3 '<status>' """
  15. process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
  16. (proc, error) = process.communicate()
  17. sys.stdout.flush()
  18.  
  19. if error:
  20. print(error)
  21. sys.exit(1)
  22.  
  23. ids = re.findall('[0-9]{10,13}', proc)
  24. texts = re.findall('<text>[\x20-\x7E]+</text>', proc)
  25.  
  26. #check that the number of tweets ='s the number of ids
  27. # strip <text> fields from front & read
  28. if len(ids) == len(texts):
  29. for i in texts:
  30. final_texts.append(i.rstrip('</text>').lstrip('<text>'))
  31.  
  32. return(ids, final_texts)
  33.  
  34. else:
  35. return (0,0)
  36.  
  37. if __name__ == "__main__":
  38. #using local couchdb server for now
  39. server = couchdb.client.Server()
  40.  
  41. (ids, tweets) = get_tweets()
  42.  
  43. if ids == 0 or tweets == 0:
  44. print("Mismatch on count between id's and tweets.")
  45. sys.exit(1)
  46.  
  47. # Test to see if the db exists, create if it doesn't
  48. try:
  49. db = server['tweet_dumps']
  50. except couchdb.http.ResourceNotFound:
  51. db = server.create('tweet_dumps')
  52.  
  53. for i in range(len(ids)):
  54. try:
  55. rev = db[ids[i]].rev
  56. db_tweets = db[ids[i]].values()[0]
  57.  
  58. # to get rid of duplicate entries, which happen more
  59. # often than you think.
  60. if tweets[i] not in db_tweets:
  61. db_tweets.append(tweets[i])
  62.  
  63. db.update([Tweet( _id = ids[i], _rev = rev,
  64. tweet_data = db_tweets)])
  65.  
  66. except couchdb.http.ResourceNotFound:
  67. db.save(Tweet(_id = ids[i], tweet_data = [tweets[i]]))

To be frank, this started off as a copy and paste project. All the CouchDB code was copied & pasted from the previous CouchDB post and the tweet grabbing code was left over from one of the old tweet dump scripts. Obviously some of the original code has changed as the Tweet class is a little different, the database name is different, and one or two other things have changed.

One of the things that really surprised me about doing this project now as opposed to over a year ago was the amount of duplicates I captured. The last time I did this, I didn’t get a single duplicate in the public timeline. Now, in just one 24 hour capture I had one “tweeter” tweet the same tweet 118 times. That is why there is code in there for not appending duplicates (lines 69 and 70). I don't want to see the same tweet 118 times, nor do I want to store it. I know space is cheap, but I don't want to “pay” for keeping 118 copies of the same thing.

I will fully admit at this point that I found those 118 tweets by one person just by doing a little mouse clicking through the CouchDB web interface. I haven't yet figured out how to use the particular reduce function to find which ID wrote the most tweets. That will more than likely be the next blog post in this series.

After some time and reviewing the results of my capturing, I decided to modify the code a little, this time including a time stamp for each tweet captured (differences only pasted below):

  1. class Tweet(couchdb.Document):
  2. _id = TextField()
  3. _rev= TextField()
  4. tweet_data = DictField(Mapping.build(
  5. datetime = DateTimeField(),
  6. text = TextField()
  7. ))
  8.  
  9. ...
  10.  
  11.  
  12. for i in range(len(ids)):
  13. try:
  14. rev = db[ids[i]].rev
  15. db_tweets_dict = db[ids[i]].values()[0]
  16. db_tweets_dict[str(datetime.datetime.now())] = tweets[i]
  17. db.update([Tweet( _id = ids[i], _rev = rev,
  18. tweet_data = db_tweets_dict)])
  19.  
  20. except couchdb.http.ResourceNotFound:
  21. db.save(Tweet(_id = ids[i], tweet_data = {
  22. str(datetime.datetime.now()):tweets[i]}))

As you can see, there are some subtle differences between the two scripts. One important difference is the shell out command was changed; I used an extra grep to help reduce the data that python has to process. I did this to reduce a lot of id & tweet mismatch counts I was getting. This seemed to work so I stuck with it. The most important difference is inside the Tweet class; the Tweet class was changed from a list of TextFields to a DictField that houses a DateTimeField and a TextField. The other serious difference is the code to update the tweet_data variable, as there’s different code used to update a list data type as opposed to a dictionary data type.Otherwise these two scripts are exactly the same.

This does lead me to question how Twitter views, perceives, or deals with its public timeline. I alsowonder how accurate the portrayal of the public timeline is in relation to Twitter usage. If the public timeline is not an accurate portrayal of Twitter usage, then what is the point? But if it is, then maybe people aren't using the service as much as Twitter wants people to think they are.

--PS, sorry about the geshi putting in the html tag in the curl command above. I'm trying to fix that right now.

Syndicate content