Smarter Filtering

At work I have a bunch of scripts that tail logs with STDOUT being redirected to some grep statements to filter out just the parts that I am interested in. Well, it would be nice to do the same thing but have the filter be smart enough to discover interesting lines.

To begin with, I took a small access.log. Only 556 lines in it. I used a couple of simple grep statements to generate a file with 10 lines that have information that I am interested in and another one with 200 lines that I’m probably not interested in. I chose the term robot because it should be easy to find the webcrawlers that are hitting this site using this search term.

$ grep robot access.log.2007-02-11 | head -n 10 > watch.txt
$ grep -v robot access.log.2007-02-11 | head -n 200 > ignore.txt

Next, I wrote a real quick script in ruby that uses a gem called classifier. It provides a naive Bayesian classifier that I can use. This is similar to how some spam filters work.

#!/usr/bin/ruby

require 'rubygems'
require 'classifier'
require 'stemmer'

classifier = Classifier::Bayes.new( 'Watch', 'Ignore' )

IO.foreach( ARGV[0] ) do |line|
   classifier.train 'watch', line
end

IO.foreach( ARGV[1] ) do |line|
   classifier.train 'ignore', line
end

$stdin.each { |line|
   puts line unless $stdin.eof || ( classifier.classify( line ) == "Ignore" )
}

Here’s how I executed it.

$ cat access.log.2007-02-11 | ruby ./filter.rb watch.txt ignore.txt

It actually worked pretty well. It identified 123 lines of interest in my log file, including the 10 that I used to train it. Most of the lines are from Yahoo Slurp and Googlebot. Only 27 of the lines have the word robot in them. A few minutes of analyzing the file, and it appears that there were only 11 false positives. Not too shabby for a first try.

There’s probably more “rubyesque” ways to have pulled it off. I really didn’t like the syntax that I had to use in order to read in all the lines from STDIN. There’s two or three really intuitive ways to do that in Perl, and I just wasn’t able to find one in Ruby tonight.

Leave a Reply

You must be logged in to post a comment.