Apache Log IP Counting

It all started with a simple question: “Which single IP shows up the most in an apache log file?” To which I ultimately came up with an answer using a string of Bash commands:

 less apache_log | cut -d '-' -f 1 | sort -r | uniq -c

This produced some text displaying how many times various IP addresses made requests from an apache web server. After seeing the results and thinking for a moment, I remembered that there is a “new” data type – as of Python 2.7 and new to me at least – called “Counter” within the collections module in the standard library that would allow me to recreate this in Python. So I quickly whipped up the code below:

  1. #!/usr/bin/env python
  2.  
  3. from sys import argv
  4. from collections import Counter
  5.  
  6. if __name__ == "__main__":
  7. with open(argv[1]) as f:
  8. c = Counter(i.partition("-")[0] for i in f)
  9.  
  10. for k,v in c.most_common():
  11. print '{:>4} {}'.format(v,k)

VIOLA! I get the same results as if I'd done it in Bash. Of course, like the madman I am, I wasn't happy to just stop at Python; I had to write this up in Haskell too. So I said “damn the torpedoes,” fired up my favorite text editor, and started to hack and slash my way toward a third version. After much documentation reading, internet searching, and trial and error I can proudly proclaim, “MISSION ACCOMPLISHED”:

  1. module Main where
  2.  
  3. import System.Environment (getArgs)
  4. import qualified Data.Text as T hiding (map, head, zip)
  5. import qualified Data.Text.IO as TI (readFile)
  6. import Data.Map hiding (map)
  7.  
  8. type Counter = (T.Text, Int)
  9.  
  10. mapUnpack :: Counter -> String
  11. mapUnpack (k,v) = show v ++ " " ++ T.unpack k
  12.  
  13. sortMapByValue :: Map T.Text Int -> [Counter]
  14. sortMapByValue = rqsort . toList
  15.  
  16. -- copied & modified by LYHGG in the recusion chapter
  17. -- doing in reverse order as to better suit the output
  18. rqsort :: [Counter] -> [Counter]
  19. rqsort [] = []
  20. rqsort ((k,v):xs) =
  21. let smallerSorted = rqsort [(k',v') | (k',v') <- xs, v' <= v]
  22. biggerSorted = rqsort [(k',v') | (k',v') <- xs, v' > v]
  23. in biggerSorted ++ [(k,v)] ++ smallerSorted
  24.  
  25. main :: IO ()
  26. main = do
  27. results <- getArgs >>= TI.readFile . head
  28. let results' = map (\x -> T.replace space nothing $ head $ T.splitOn dash x) $ T.lines results
  29. let results'' = fromListWith (+) . zip results' $ repeat (1 :: Int)
  30. mapM_ (print . mapUnpack) $ sortMapByValue results''
  31. where dash = T.pack "-"
  32. space = T.pack " "
  33. nothing = T.pack ""

The Haskell implementation took me longer than I expected as there were a couple of challenges to figure out:
1) Learning the Data.Map datatype and figuring out how to build that datatype from a list of IP addresses.
2) Sorting the Map by value and not by key.
3) Unpacking the Map to be printed.

The first problem was solved by a lucky internet search that showed me an example of how to use the 'fromListWith' function. This function performed the heavy lifting of sorting and counting the IP addresses in the log file. For the sorting by value instead of by key problem I was able to construct my own solution by tweaking the quicksort example from Learn You a Haskell. The changes I made expanded the tuple, allowing it to compare values and sort them in descending order. For the text holding and manipulation involved in Map unpacking, I've been told by other Haskellers that Data.Text is “the way to go” (as the default String implementation is a little slow and lacking features). While Data.Text did provide me an easy way to split the string and grab the IP addresses, it also required translating the IP addresses back into a String data type before Haskell would print it. Thus my need for a specific function to create a string based on each item in the Map. In the grand scheme of things having to write the mapUnpack function wasn't horrible...it was just one more hoop that I had to jump through before I could call this project complete. After these modifications I was able to put this program together without TOO much hassle, and in the end got the exact same results as both the Bash string and the Python program.

Would I recommend writing these scripts for work instead of using the string of Bash commands? No, especially if you're asked this question in a high-stress situation. However, these little personal challenges were a great way to expand my programming skills, particularly in learning about new libraries like the collections module in Python and the Data.Map module in Haskell. Having this random new knowledge might not seem worthwhile upfront, but might come in handy in the future if I ever encounter a problem that simple strings of Bash commands can't handle.

Connect 4

A couple of weeks ago I was lucky enough to have my employer send me to PyCon. If you weren't at PyCon you missed out on a lot of things...like the invading squirrel hordes. Thankfully, all of the talks are viewable here. But PyCon is not the focus of this post, just a starting point. At PyCon, in the vendor area, there was the Thumbtack booth. The guys at this book weren't doing the “normal” conference thing (passing out schwag to every being, living or dead, that passed them); they actually made people work for their swag. Thumbtack had a programming challenge that you had to submit a solution to before they would give you either a large shot glass or a glass beer stein.
The challenge was to accept a list of lists from standard input, and parse it looking for a winner in a game of Connect 4. I believe this could be a great interview question, and will be using it for the upcoming interviews where I work. One of the reasons for this is that it's deceptively difficult. While we as humans have been conditioned to recognize patterns since birth, computers need to be taught every step from the beginning. How do I go about teaching something that I don't remember learning? But enough of my babbling - let's look at my solution:

  1. #!/usr/bin/env python
  2.  
  3. from __future__ import print_function
  4. import sys
  5.  
  6. def char_check(row, col, char, prev_row=0, prev_col=0, count=0, direction=0):
  7. direction_dict = {1: None,
  8. 2: None,
  9. 3: None,
  10. 4: lambda: char_check(row, col - 1, char, row, col, count + 1, 4),
  11. 5: None,
  12. 6: lambda: char_check(row, col + 1, char, row, col, count + 1, 6),
  13. 7: lambda: char_check(row + 1, col - 1, char, row, col, count + 1, 7),
  14. 8: lambda: char_check(row + 1, col, char, row, col, count + 1, 8),
  15. 9: lambda: char_check(row + 1, col + 1, char, row, col, count + 1, 9)}
  16.  
  17. if count == 3:
  18. print("Winner: %s" % char)
  19. sys.exit(0)
  20.  
  21. direction_list = []
  22. direction_list_append = direction_list.append
  23.  
  24. try:
  25. if (four_list[row][col - 1] == char and
  26. (row != prev_row or (col - 1) != prev_col) and
  27. direction in (0,4)):
  28. direction_list_append(4)
  29. elif (four_list[row][col + 1] == char and
  30. (row != prev_row or (col + 1) != prev_col) and
  31. direction in (0,6)):
  32. direction_list_append(6)
  33. elif (four_list[row + 1][col - 1] == char and
  34. ((row + 1) != prev_row or (col - 1) != prev_col) and
  35. direction in (0,7)):
  36. direction_list_append(7)
  37. elif (four_list[row + 1][col] == char and
  38. ((row + 1) != prev_row or col != prev_col) and
  39. direction in (0,8)):
  40. direction_list_append(8)
  41. elif (four_list[row + 1][col + 1] == char and
  42. ((row + 1) != prev_row or (col + 1) != prev_col) and
  43. direction in (0,9)):
  44. direction_list_append(9)
  45.  
  46. for d in direction_list:
  47. direction_dict[d]()
  48.  
  49. except IndexError:
  50. pass
  51.  
  52. if __name__ == "__main__":
  53. try:
  54. four_list = eval(sys.stdin.read())
  55. except SyntaxError:
  56. print("Error getting list from the web, using preprogrammed backup.")
  57. four_list = [
  58. [".", ".", ".", ".", ".", ".", "."],
  59. [".", ".", ".", ".", ".", ".", "."],
  60. [".", ".", "O", ".", ".", ".", "."],
  61. [".", ".", "X", "O", "X", "X", "."],
  62. [".", ".", "X", "X", "O", "O", "X"],
  63. [".", ".", "O", "X", "X", "O", "X"]
  64. ]
  65. finally:
  66. for r,_ in enumerate(four_list):
  67. [char_check(r, col, four_list[r][col]) for col,_ in enumerate(four_list[r]) if four_list[r][col] != "."]
  68.  
  69. print("No Winner")

Maybe it's my preference for functional programming coming out, but when I looked at this problem I thought “recursion,” remembering from college how much easier it is to solve the Tower of Hanoi problem with recursion than without. The problem became a little more complicated when I went from solving the original example to generating and testing a different board.

At the moment I'm trying to figure out a Haskell version of this solution. Hopefully I'll have one soon and I'll update this page when I do.

Before I posted this, I wrote a quick email to the person I spoke to at the Thumbtack booth and sent him my solution. He thanked me for the solution and requested my address to mail a mug to me, which showed up in the mail couple of days ago. (Below is a pic of the mug filled with a beautiful amber beer.) He also sent me a link to the company's blog post about their experiences using a coding challenge to earn schwag. Here is the link. There are also some pretty impressive solutions to the challenge there, including one done in regular expressions, which deserves a tip of the hat in my book.

PyCon

Hey Everybody, I will be attending the US PyCon starting tomorrow. If anyone of my readers are in town for the convention and are interested, I would happily meet up with you for drinks, food, or whatever. Send me a tweet.

Project Euler: Problem 12

It’s time once again for a favorite blog theme, The Project Euler post. This time around I am answering problem twelve. The website states the problem as:
The sequence of triangle numbers is generated by adding the natural numbers. So the 7th triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28. The first ten terms would be:

1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...

Let us list the factors of the first seven triangle numbers:

1: 1
3: 1,3
6: 1,2,3,6
10: 1,2,5,10
15: 1,3,5,15
21: 1,3,7,21
28: 1,2,4,7,14,28
We can see that 28 is the first triangle number to have over five divisors.

What is the value of the first triangle number to have over five hundred divisors?

o spice things up I decided to use a language I haven't used for these in a while, Perl. I also included the usual suspects: Python and Haskell. So, here's the Perl code:

  1. #!/usr/bin/perl
  2.  
  3. use strict;
  4. use warnings;
  5.  
  6. my $index = 7;
  7. my $total = 28;
  8. my $divisors = 0;
  9.  
  10. sub divisors
  11. {
  12. my ($number) = @_;
  13. my $sq_n = sqrt($number);
  14. my $i = 1;
  15. my $t = 0;
  16.  
  17. while ($i <= $sq_n)
  18. {
  19. $t += 2 unless ($number % $i);
  20.  
  21. $i += 1;
  22. }
  23.  
  24. return $t;
  25. }
  26.  
  27. while( divisors($total) <= 500)
  28. {
  29. $index += 1;
  30. $total += $index;
  31. }
  32.  
  33. print "$total\n";

Nothing really new or interesting to mention in this code. Here is the Python code:
  1. #!/usr/bin/python
  2.  
  3. """
  4. solution for problem 12 in python.
  5. """
  6. import math
  7.  
  8. def get_divisors(number):
  9. tlist = []
  10. for x in xrange(2, int(math.sqrt(number))):
  11. d,r = divmod(number,x)
  12. if r == 0:
  13. tlist.append(x)
  14. tlist.append(d)
  15.  
  16. return len([1, number] + tlist)
  17.  
  18. def triangle_nums():
  19. iterator = 7
  20. num = 28
  21.  
  22. while True:
  23. yield num
  24. iterator += 1
  25. num += iterator
  26.  
  27. if __name__ == "__main__":
  28. tn = triangle_nums()
  29. for t in tn:
  30. tl = get_divisors(t)
  31. if tl > 500:
  32. print "num: %d\ncount: %d" % (t,tl)
  33. break

Pretty standard stuff for the most part. I think the only non-standard thing worth mentioning is the infinite triangle number generator. This took a little finangling, but I got it to work in the end.

Here is the Haskell code:

  1. module Main where
  2.  
  3. get_div_len number = foldl1 (+) [2 | x <- [1..x], number `mod` x == 0]
  4. where x = round . sqrt $ fromInteger number
  5.  
  6. main :: IO()
  7. main = do
  8. print . head $ dropWhile (\x -> fst x <= 499) (map (\x -> (get_div_len x ,x)) xs)
  9. where xs = map (\y -> sum [1..y]) [7..]

After creating these solutions, I did my usual, highly accurate, testing method to determine the speed of the computation. I was surprised by my results:

Perl: 12.462s
Python: 17.783s
Haskell (compiled): 13.877s

Normally the Haskell solution would be significantly faster, and I have a theory as to why the Haskell times are so close. In Perl and Python I’m doing two additions – one for the increase of the index number and another to increase the total number. In Haskell I’m doing 1 + n additions; the first is to increase the index, and the remaining additions (n) are those used to calculate the sum of all the numbers between (and including) 1 and the index. As the index variable gets larger, that calculation takes more and more time to perform. I would write this up as suboptimal. After spending some time traveling “the tubes,” I discovered the State Monad, which is the reason why this blog post took me so long. I had to spend a week going through random blogs, skimming books, and beating my head against a wall (more than usual) to figure this out.

Quick diversion, for those of you who do not know what the State Monad is, let me take a moment to to try and explain what it is and why it’s important in this context. Those of us that come from an imperative language (I am one of you in this regard) are used to being able to do a simple addition such as (in pseudo code):

Variable = 2
Variable = variable + 3 or Variable += 3

We can’t do this in Haskell; instead we have to create a new variable name for each new variable assignment. We could also create a function that recursively goes forward, generating the next number in the sequence and bringing our needed variables with us before going deeper down the recursion rabbit hole. With the State Monad, however, we can write our function in such a way that the necessary variables are implicitly passed. Take a look at the new solution to see what I mean:

  1. {-# LANGUAGE BangPatterns, UnboxedTuples #-}
  2. module Main where
  3.  
  4. import Control.Monad
  5. import Control.Monad.State
  6.  
  7. type MyState = (Int, Int)
  8. s0 = (7, 28)
  9.  
  10. tick = do
  11. (n,o) <- get
  12. let divs = getDivLen (n,o)
  13. if divs <= 500
  14. then do
  15. let n' = n + 1
  16. let o' = o + n'
  17. put (n', o')
  18. tick
  19. else
  20.  
  21. getDivLen :: MyState -> Int
  22. getDivLen (!n, !o) = foldl1 (+) [2 | x <- [1..x], o `mod` x == 0]
  23. where x = round . sqrt $ fromIntegral o
  24.  
  25. main :: IO ()
  26. main = print $ evalState tick s0

The tick function does not have any input parameters. All the information that the function needs comes from the “get” function call, which grabs the current state from the State Monad. If the tick function does not find a number of divisors greater than five hundred, it inserts new values back into the State Monad, and goes down to the next level of recursion.

It took me a long time to figure this out, mostly because of the lack of examples on the internet concerning the State Monad. If I wanted to create a random number generator I would have been set, but sadly I just wanted to create something that would hold a tuple of numbers and increment them accordingly. So I highly modified one of the “random number generator” examples.

My “highly accurate” speed test results for the new version is:
Haskell (compiled): 2.664s

which is a vast improvement (> 11s) over the previous implementation.

While a Project Euler problem may not have been the best way to learn about using the State Monad, I'm glad I stumbled upon it. I hope that it can be used as an example for others if they want to learn how to use the this particular monad to create things other than pseudo-random number generators.

One last thing – some of the brighter crayons in the box (which is most of you, based on the level of comments that I receive) might have noticed that I skipped problem 11. There is a simple response to that. I still haven’t solved it.

Syndicate content