Haskell

Apache Log IP Counting

It all started with a simple question: “Which single IP shows up the most in an apache log file?” To which I ultimately came up with an answer using a string of Bash commands:

 less apache_log | cut -d '-' -f 1 | sort -r | uniq -c

This produced some text displaying how many times various IP addresses made requests from an apache web server. After seeing the results and thinking for a moment, I remembered that there is a “new” data type – as of Python 2.7 and new to me at least – called “Counter” within the collections module in the standard library that would allow me to recreate this in Python. So I quickly whipped up the code below:

  1. #!/usr/bin/env python
  2.  
  3. from sys import argv
  4. from collections import Counter
  5.  
  6. if __name__ == "__main__":
  7. with open(argv[1]) as f:
  8. c = Counter(i.partition("-")[0] for i in f)
  9.  
  10. for k,v in c.most_common():
  11. print '{:>4} {}'.format(v,k)

VIOLA! I get the same results as if I'd done it in Bash. Of course, like the madman I am, I wasn't happy to just stop at Python; I had to write this up in Haskell too. So I said “damn the torpedoes,” fired up my favorite text editor, and started to hack and slash my way toward a third version. After much documentation reading, internet searching, and trial and error I can proudly proclaim, “MISSION ACCOMPLISHED”:

  1. module Main where
  2.  
  3. import System.Environment (getArgs)
  4. import qualified Data.Text as T hiding (map, head, zip)
  5. import qualified Data.Text.IO as TI (readFile)
  6. import Data.Map hiding (map)
  7.  
  8. type Counter = (T.Text, Int)
  9.  
  10. mapUnpack :: Counter -> String
  11. mapUnpack (k,v) = show v ++ " " ++ T.unpack k
  12.  
  13. sortMapByValue :: Map T.Text Int -> [Counter]
  14. sortMapByValue = rqsort . toList
  15.  
  16. -- copied & modified by LYHGG in the recusion chapter
  17. -- doing in reverse order as to better suit the output
  18. rqsort :: [Counter] -> [Counter]
  19. rqsort [] = []
  20. rqsort ((k,v):xs) =
  21. let smallerSorted = rqsort [(k',v') | (k',v') <- xs, v' <= v]
  22. biggerSorted = rqsort [(k',v') | (k',v') <- xs, v' > v]
  23. in biggerSorted ++ [(k,v)] ++ smallerSorted
  24.  
  25. main :: IO ()
  26. main = do
  27. results <- getArgs >>= TI.readFile . head
  28. let results' = map (\x -> T.replace space nothing $ head $ T.splitOn dash x) $ T.lines results
  29. let results'' = fromListWith (+) . zip results' $ repeat (1 :: Int)
  30. mapM_ (print . mapUnpack) $ sortMapByValue results''
  31. where dash = T.pack "-"
  32. space = T.pack " "
  33. nothing = T.pack ""

The Haskell implementation took me longer than I expected as there were a couple of challenges to figure out:
1) Learning the Data.Map datatype and figuring out how to build that datatype from a list of IP addresses.
2) Sorting the Map by value and not by key.
3) Unpacking the Map to be printed.

The first problem was solved by a lucky internet search that showed me an example of how to use the 'fromListWith' function. This function performed the heavy lifting of sorting and counting the IP addresses in the log file. For the sorting by value instead of by key problem I was able to construct my own solution by tweaking the quicksort example from Learn You a Haskell. The changes I made expanded the tuple, allowing it to compare values and sort them in descending order. For the text holding and manipulation involved in Map unpacking, I've been told by other Haskellers that Data.Text is “the way to go” (as the default String implementation is a little slow and lacking features). While Data.Text did provide me an easy way to split the string and grab the IP addresses, it also required translating the IP addresses back into a String data type before Haskell would print it. Thus my need for a specific function to create a string based on each item in the Map. In the grand scheme of things having to write the mapUnpack function wasn't horrible...it was just one more hoop that I had to jump through before I could call this project complete. After these modifications I was able to put this program together without TOO much hassle, and in the end got the exact same results as both the Bash string and the Python program.

Would I recommend writing these scripts for work instead of using the string of Bash commands? No, especially if you're asked this question in a high-stress situation. However, these little personal challenges were a great way to expand my programming skills, particularly in learning about new libraries like the collections module in Python and the Data.Map module in Haskell. Having this random new knowledge might not seem worthwhile upfront, but might come in handy in the future if I ever encounter a problem that simple strings of Bash commands can't handle.

Project Euler: Problem 12

It’s time once again for a favorite blog theme, The Project Euler post. This time around I am answering problem twelve. The website states the problem as:
The sequence of triangle numbers is generated by adding the natural numbers. So the 7th triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28. The first ten terms would be:

1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...

Let us list the factors of the first seven triangle numbers:

1: 1
3: 1,3
6: 1,2,3,6
10: 1,2,5,10
15: 1,3,5,15
21: 1,3,7,21
28: 1,2,4,7,14,28
We can see that 28 is the first triangle number to have over five divisors.

What is the value of the first triangle number to have over five hundred divisors?

o spice things up I decided to use a language I haven't used for these in a while, Perl. I also included the usual suspects: Python and Haskell. So, here's the Perl code:

  1. #!/usr/bin/perl
  2.  
  3. use strict;
  4. use warnings;
  5.  
  6. my $index = 7;
  7. my $total = 28;
  8. my $divisors = 0;
  9.  
  10. sub divisors
  11. {
  12. my ($number) = @_;
  13. my $sq_n = sqrt($number);
  14. my $i = 1;
  15. my $t = 0;
  16.  
  17. while ($i <= $sq_n)
  18. {
  19. $t += 2 unless ($number % $i);
  20.  
  21. $i += 1;
  22. }
  23.  
  24. return $t;
  25. }
  26.  
  27. while( divisors($total) <= 500)
  28. {
  29. $index += 1;
  30. $total += $index;
  31. }
  32.  
  33. print "$total\n";

Nothing really new or interesting to mention in this code. Here is the Python code:
  1. #!/usr/bin/python
  2.  
  3. """
  4. solution for problem 12 in python.
  5. """
  6. import math
  7.  
  8. def get_divisors(number):
  9. tlist = []
  10. for x in xrange(2, int(math.sqrt(number))):
  11. d,r = divmod(number,x)
  12. if r == 0:
  13. tlist.append(x)
  14. tlist.append(d)
  15.  
  16. return len([1, number] + tlist)
  17.  
  18. def triangle_nums():
  19. iterator = 7
  20. num = 28
  21.  
  22. while True:
  23. yield num
  24. iterator += 1
  25. num += iterator
  26.  
  27. if __name__ == "__main__":
  28. tn = triangle_nums()
  29. for t in tn:
  30. tl = get_divisors(t)
  31. if tl > 500:
  32. print "num: %d\ncount: %d" % (t,tl)
  33. break

Pretty standard stuff for the most part. I think the only non-standard thing worth mentioning is the infinite triangle number generator. This took a little finangling, but I got it to work in the end.

Here is the Haskell code:

  1. module Main where
  2.  
  3. get_div_len number = foldl1 (+) [2 | x <- [1..x], number `mod` x == 0]
  4. where x = round . sqrt $ fromInteger number
  5.  
  6. main :: IO()
  7. main = do
  8. print . head $ dropWhile (\x -> fst x <= 499) (map (\x -> (get_div_len x ,x)) xs)
  9. where xs = map (\y -> sum [1..y]) [7..]

After creating these solutions, I did my usual, highly accurate, testing method to determine the speed of the computation. I was surprised by my results:

Perl: 12.462s
Python: 17.783s
Haskell (compiled): 13.877s

Normally the Haskell solution would be significantly faster, and I have a theory as to why the Haskell times are so close. In Perl and Python I’m doing two additions – one for the increase of the index number and another to increase the total number. In Haskell I’m doing 1 + n additions; the first is to increase the index, and the remaining additions (n) are those used to calculate the sum of all the numbers between (and including) 1 and the index. As the index variable gets larger, that calculation takes more and more time to perform. I would write this up as suboptimal. After spending some time traveling “the tubes,” I discovered the State Monad, which is the reason why this blog post took me so long. I had to spend a week going through random blogs, skimming books, and beating my head against a wall (more than usual) to figure this out.

Quick diversion, for those of you who do not know what the State Monad is, let me take a moment to to try and explain what it is and why it’s important in this context. Those of us that come from an imperative language (I am one of you in this regard) are used to being able to do a simple addition such as (in pseudo code):

Variable = 2
Variable = variable + 3 or Variable += 3

We can’t do this in Haskell; instead we have to create a new variable name for each new variable assignment. We could also create a function that recursively goes forward, generating the next number in the sequence and bringing our needed variables with us before going deeper down the recursion rabbit hole. With the State Monad, however, we can write our function in such a way that the necessary variables are implicitly passed. Take a look at the new solution to see what I mean:

  1. {-# LANGUAGE BangPatterns, UnboxedTuples #-}
  2. module Main where
  3.  
  4. import Control.Monad
  5. import Control.Monad.State
  6.  
  7. type MyState = (Int, Int)
  8. s0 = (7, 28)
  9.  
  10. tick = do
  11. (n,o) <- get
  12. let divs = getDivLen (n,o)
  13. if divs <= 500
  14. then do
  15. let n' = n + 1
  16. let o' = o + n'
  17. put (n', o')
  18. tick
  19. else
  20.  
  21. getDivLen :: MyState -> Int
  22. getDivLen (!n, !o) = foldl1 (+) [2 | x <- [1..x], o `mod` x == 0]
  23. where x = round . sqrt $ fromIntegral o
  24.  
  25. main :: IO ()
  26. main = print $ evalState tick s0

The tick function does not have any input parameters. All the information that the function needs comes from the “get” function call, which grabs the current state from the State Monad. If the tick function does not find a number of divisors greater than five hundred, it inserts new values back into the State Monad, and goes down to the next level of recursion.

It took me a long time to figure this out, mostly because of the lack of examples on the internet concerning the State Monad. If I wanted to create a random number generator I would have been set, but sadly I just wanted to create something that would hold a tuple of numbers and increment them accordingly. So I highly modified one of the “random number generator” examples.

My “highly accurate” speed test results for the new version is:
Haskell (compiled): 2.664s

which is a vast improvement (> 11s) over the previous implementation.

While a Project Euler problem may not have been the best way to learn about using the State Monad, I'm glad I stumbled upon it. I hope that it can be used as an example for others if they want to learn how to use the this particular monad to create things other than pseudo-random number generators.

One last thing – some of the brighter crayons in the box (which is most of you, based on the level of comments that I receive) might have noticed that I skipped problem 11. There is a simple response to that. I still haven’t solved it.

Pangrams


If you haven’t figured it out by now, I enjoy solving problems. And over the course of the last year or two, I’ve learned that interview questions make for great problems to work on. Actual interview problems are nice because they are usually quick, but have a quirk or two in there that makes them challenging, unlike simple questions like the Fizz Buzz problem that just checks if you have the most basic coding skills. (Has anyone actually been asked that question in an interview?)

The most recent problem I got to sink my teeth into (found it on a recruiting site, but not going to share where I got it; wouldn’t be fair to the company posting the problem) is for finding pangrams in sentences. If you don’t know what a pangram is Wikipedia defines them as, “a sentence using every letter of the alphabet at least once.” Yeah, I didn’t know what they were either until I started programming this little puzzle. Here is the code:

  1. module Main (main) where
  2.  
  3. import System (getArgs)
  4. import qualified Data.Set as S
  5. import qualified Data.Text as T
  6. import qualified Data.Text.IO as TI (readFile)
  7.  
  8. buildList :: FilePath -> IO [T.Text]
  9. buildList filename = TI.readFile filename >>=
  10. return . map (T.toLower . T.filter (/=' ')) . T.lines
  11.  
  12. compareAndPrint :: S.Set Char -> String
  13. compareAndPrint sset = if S.null result
  14. then "NULL"
  15. else S.toList result
  16. where result = S.difference (S.fromList ['a'..'z']) sset
  17.  
  18. main = do
  19. args <- getArgs
  20. sentences <- buildList $ head args
  21. mapM_ (putStrLn . compareAndPrint) $ map( S.fromList . T.unpack) sentences

I came up with the solution pretty quickly by using Sets. Having a set of the alphabet and finding the difference of the letters used in the sentence makes the problem almost trivial. The hard part for me was figuring out how to filter out the spaces and change all characters to lower case in the buildList function. I eventually figured it out, but it took some head against wall action to get it right.

This is going to be my last post for this year. I would like to wish you all Happy Holidays and a Happy New Year. Thank for reading and see you again in 2012. I would also like to thank everyone from planet.haskell.org who decided to read this. Welcome!

Programming Praxis: The Sum of Two Integers

A couple of months ago the Programming Praxis website put up a challenge to find a sum inside an array of integers (the direct wording of the challenge can be found here) and since I’ve come up with my own solution, this little challenge has provided me with a lot of feedback.

Just to get some of the geeky stuff out of the way, here is the code I wrote for the problem:

  1. import Data.List
  2. import Data.Maybe
  3.  
  4. sumCheck :: Int -> [Int] -> [Int] -> Maybe (Int, Int)
  5. sumCheck _ [] _ = Nothing
  6. sumCheck total (x:xs) ys = if total' == Nothing
  7. then sumCheck total xs ys
  8. else return (x, (ys !! ( fromJust total')))
  9. where total' = (total - x) `elemIndex` ys

In thinking about the problem a little bit I came up with this subtraction approach. My first approach was to use addition and add every item in the array against all the other items. But this method didn’t sit well with me. After a little bike ride I came up with the code you see above.

After I wrote it, I submitted my code to the Haskell-beginners email list asking for critiques and possible enhancements. Arlen Cuss contributed a slight improvement of my code:

  1. sumCheck total (x:xs) ys =
  2. let diff = total - x
  3. in if diff `elem` ys
  4. then Just (x, diff)
  5. else sumCheck total xs ys

And Adityz Siram contributed his version. Which is basically the first algorithm that I came up with and wanted to improve upon. His code is here:
  1. sums i as bs = [(x,y) | x <- as, y <- bs, x + y == i]

Finally, Gary Klindt took all of our code snippets, used some performance analysis tools inside GHC and came up with some run times that are (hopefully) more accurate than running time on an application. Here are those stats:
print $ sumCheck 500 [1..1000] [1..1000]
sumCheck1: 58,648
sumCheck2: 58,484
sumCheck3: 70,016

print $ sumCheck 5000 [1..10000] [1..10000]
sumCheck1: 238,668
sumCheck2: 238,504
sumCheck3: 358,016
(unit: byte)

Out of the three code snippets, my function was in the middle, speed-wise. But I think that it’s also really nice to see how much better it is than the regular addition method. It’s also nice to see how the little change made to my code can improve the overall speed of the function.

At the end of the day I take a little bit of pride in myself for coming up with an improved algorithm for this task on my own. I know that on a hardware level, subtraction takes more time than addition. But I get the improvements I get because I reduce the number of additions and comparisons I have to make in order for the function to be complete. I also estimate the worst case speed for my algorithm to be O(n), which isn’t too shabby.

When I started learning Haskell, one of the things I read on the internet was how the people who programmed it were helpful to one another. I was skeptical when I first read that, but I have to say that all of my doubt has been removed. And it is interactions like this that make me glad to participate in a community as helpful as this one.

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share

Syndicate content