Apache Log IP Counting

It all started with a simple question: “Which single IP shows up the most in an apache log file?” To which I ultimately came up with an answer using a string of Bash commands:

 less apache_log | cut -d '-' -f 1 | sort -r | uniq -c

This produced some text displaying how many times various IP addresses made requests from an apache web server. After seeing the results and thinking for a moment, I remembered that there is a “new” data type – as of Python 2.7 and new to me at least – called “Counter” within the collections module in the standard library that would allow me to recreate this in Python. So I quickly whipped up the code below:

  1. #!/usr/bin/env python
  3. from sys import argv
  4. from collections import Counter
  6. if __name__ == "__main__":
  7. with open(argv[1]) as f:
  8. c = Counter(i.partition("-")[0] for i in f)
  10. for k,v in c.most_common():
  11. print '{:>4} {}'.format(v,k)

VIOLA! I get the same results as if I'd done it in Bash. Of course, like the madman I am, I wasn't happy to just stop at Python; I had to write this up in Haskell too. So I said “damn the torpedoes,” fired up my favorite text editor, and started to hack and slash my way toward a third version. After much documentation reading, internet searching, and trial and error I can proudly proclaim, “MISSION ACCOMPLISHED”:

  1. module Main where
  3. import System.Environment (getArgs)
  4. import qualified Data.Text as T hiding (map, head, zip)
  5. import qualified Data.Text.IO as TI (readFile)
  6. import Data.Map hiding (map)
  8. type Counter = (T.Text, Int)
  10. mapUnpack :: Counter -> String
  11. mapUnpack (k,v) = show v ++ " " ++ T.unpack k
  13. sortMapByValue :: Map T.Text Int -> [Counter]
  14. sortMapByValue = rqsort . toList
  16. -- copied & modified by LYHGG in the recusion chapter
  17. -- doing in reverse order as to better suit the output
  18. rqsort :: [Counter] -> [Counter]
  19. rqsort [] = []
  20. rqsort ((k,v):xs) =
  21. let smallerSorted = rqsort [(k',v') | (k',v') <- xs, v' <= v]
  22. biggerSorted = rqsort [(k',v') | (k',v') <- xs, v' > v]
  23. in biggerSorted ++ [(k,v)] ++ smallerSorted
  25. main :: IO ()
  26. main = do
  27. results <- getArgs >>= TI.readFile . head
  28. let results' = map (\x -> T.replace space nothing $ head $ T.splitOn dash x) $ T.lines results
  29. let results'' = fromListWith (+) . zip results' $ repeat (1 :: Int)
  30. mapM_ (print . mapUnpack) $ sortMapByValue results''
  31. where dash = T.pack "-"
  32. space = T.pack " "
  33. nothing = T.pack ""

The Haskell implementation took me longer than I expected as there were a couple of challenges to figure out:
1) Learning the Data.Map datatype and figuring out how to build that datatype from a list of IP addresses.
2) Sorting the Map by value and not by key.
3) Unpacking the Map to be printed.

The first problem was solved by a lucky internet search that showed me an example of how to use the 'fromListWith' function. This function performed the heavy lifting of sorting and counting the IP addresses in the log file. For the sorting by value instead of by key problem I was able to construct my own solution by tweaking the quicksort example from Learn You a Haskell. The changes I made expanded the tuple, allowing it to compare values and sort them in descending order. For the text holding and manipulation involved in Map unpacking, I've been told by other Haskellers that Data.Text is “the way to go” (as the default String implementation is a little slow and lacking features). While Data.Text did provide me an easy way to split the string and grab the IP addresses, it also required translating the IP addresses back into a String data type before Haskell would print it. Thus my need for a specific function to create a string based on each item in the Map. In the grand scheme of things having to write the mapUnpack function wasn't horrible...it was just one more hoop that I had to jump through before I could call this project complete. After these modifications I was able to put this program together without TOO much hassle, and in the end got the exact same results as both the Bash string and the Python program.

Would I recommend writing these scripts for work instead of using the string of Bash commands? No, especially if you're asked this question in a high-stress situation. However, these little personal challenges were a great way to expand my programming skills, particularly in learning about new libraries like the collections module in Python and the Data.Map module in Haskell. Having this random new knowledge might not seem worthwhile upfront, but might come in handy in the future if I ever encounter a problem that simple strings of Bash commands can't handle.


In your original "shell tools" version, you can replace "less [or cat] | cut" with a quick awk: awk '{ print $1 }' access_log | sort -r | uniq -c

Or run the entire pipeline in one awk command:

awk '{ if (!counters[$1]) {counters[$1]=1} else counters[$1]+=1 } END { for (ip in counters) print counters[ip]"\t"ip }' access_log

Notional performance comparison

One of my coworkers was wondering how my suggestions compared regarding performance. Here's some notional performance figures on the same log file using the shell's time command:

linux> time less apache_log | cut -d '-' -f 1 | sort -r | uniq -c > /dev/null

real 0m1.010s
user 0m0.991s
sys 0m0.019s
linux> time awk '{ print $1 }' apache_log | sort -r | uniq -c > /dev/null

real 0m0.657s
user 0m0.652s
sys 0m0.005s
linux> time awk '{ if (!counters[$1]) {counters[$1]=1} else counters[$1]+=1 } END { for (ip in counters) print counters[ip]"\t"ip }' apache_log > /dev/null

real 0m0.028s
user 0m0.021s
sys 0m0.007s

I only have python 2.4 and 2.6 on this machine so I can't compare quickly to your Python solution (which I apparently need to go read up on the collections module now).

The other suggestion we had when discussing it was about field separating on the dash/hyphen. The second field in the Common Log Format is an RFC1413 ident, - if not present. Although I'm not aware of a lot of folks using ident, this value technically can be a ident-returned username. We suggested parsing on whitespace (which is what awk is doing in my examples) and with split in Python (as opposed to partition). split break on whitespace by default. If you really wanted partition-like behaviour with split, it supports a second argument of maxsplits (you can provide None as the first argument to keep whitespace as the separator), ie if the variable line has a line of apache log, line.split()[0] or line.split(None,1)[0] to get the IP address.

Hey B00ga, Thank you for the

Hey B00ga,

Thank you for the reply regarding awk. I have to admit I've heard of but never used awk before. It's also really interesting to see the performance difference between using awk completely and piping the results to other bash commands.

Here are the times for my runs, using the original bash as benchmark:

less apachae.log ... : 0m0.100s
python : 0m0.523s
haskell (runghc) : 0m1.929s
haskell (compiled) : 0m0.021s

use  foldl' (\x acc ->

 foldl' (\x acc -> Map.insertWith' (+) x 1 acc) Map.empty list

my version: import

my version:

  1. import Data.List
  2. import Data.Function
  3. import Control.Monad
  4. import qualified Data.Map as Map
  5. import qualified Data.ByteString.Char8 as B
  8. countElements :: Ord a => [a] -> [(a, Integer)]
  9. countElements list =
  10. Map.toList $ foldl' (\acc x -> Map.insertWith' (+) x 1 acc) Map.empty list
  12. logGetIp = map (fst . B.break (== ' '))
  14. coolShow xs = unlines $ map (\(x,y) -> B.unpack x ++ " - " ++ show y) xs
  16. main = do list <- B.lines `liftM` B.readFile "super.log"
  17. putStrLn $ coolShow
  18. $ sortBy(compare `on` snd)
  19. $ (countElements . logGetIp) list

logfile 109MB

myprog: 0m0.627s
python: 0m0.734s
shell with cat: 0m1.864s
shell with less: 0m2.763s
blog haskell: >20 seconds and stack overflow

on large file function

on large file function "fromListWith" occur stack space overflow.

More general log field access in Python

If you want more general access to the fields in Python, I've used something like:

reader = csv.DictReader((line.replace("[", '"').replace("]", '"') for line in open(sys.argv[1])), fieldnames=['ip', None, None, 'date', 'request', 'status code', 'size', 'referrer', 'browser'], delimiter=" ", quotechar='"')

To get an iterator of dictionaries. You can then use collections.Counter on something like (row['ip'] for row in reader).

Agf, When I was coding this,


When I was coding this, I didn't even think of using the csv module or that I could change the delimiter. Agf, your knowledge of Python's standard library always impresses me.

Thank you for sharing.

Your Haskell code is a bit

Your Haskell code is a bit verbose, mainly because you don't know the language extension "OverloadedStrings" and you redefine sort :

  1. {-# LANGUAGE OverloadedStrings #-}
  2. import System.Environment
  3. import qualified Data.Text as T
  4. import qualified Data.Text.IO as T
  5. import qualified Data.Map as M
  6. import Data.Function
  7. import Data.List
  9. main = do
  10. [f] <- getArgs
  11. logs <- T.readFile f
  12. let sortedIps = sortBy (flip compare `on` snd) . count . extractIps $ logs
  13. mapM_ showIp sortedIps
  14. where
  15. count xs = M.toList . M.fromListWith (+) $ zip xs (repeat 1)
  16. extractIps = map (fst . T.break (== ' ')) . T.lines
  17. showIp (ip, c) = T.putStrLn $ T.concat [T.pack (show c), " -> ", ip]

This Counter library seems nice.

Jedai, Thank you for sharing


Thank you for sharing your rewrite of my program. I can see some of the modification that I could do to make my haskell code less verbose. I will definitely be spending time reading your code and following through on the libraries you used. You are also correct in that I didn't know about "OverloadedStrings". I will be looking that up as soon as I finish this comment.

If you made it this far down into the article, hopefully you liked it enough to share it with your friends. Thanks if you do, I appreciate it.

Bookmark and Share