Smarter Filtering
At work I have a bunch of scripts that tail logs with STDOUT being redirected to some grep statements to filter out just the parts that I am interested in. Well, it would be nice to do the same thing but have the filter be smart enough to discover interesting lines.
To begin with, I took a small access.log. Only 556 lines in it. I used a couple of simple grep statements to generate a file with 10 lines that have information that I am interested in and another one with 200 lines that I’m probably not interested in. I chose the term robot because it should be easy to find the webcrawlers that are hitting this site using this search term.
$ grep robot access.log.2007-02-11 | head -n 10 > watch.txt$ grep -v robot access.log.2007-02-11 | head -n 200 > ignore.txt
Next, I wrote a real quick script in ruby that uses a gem called classifier. It provides a naive Bayesian classifier that I can use. This is similar to how some spam filters work.
#!/usr/bin/ruby
require 'rubygems'
require 'classifier'
require 'stemmer'
classifier = Classifier::Bayes.new( 'Watch', 'Ignore' )
IO.foreach( ARGV[0] ) do |line|
classifier.train 'watch', line
end
IO.foreach( ARGV[1] ) do |line|
classifier.train 'ignore', line
end
$stdin.each { |line|
puts line unless $stdin.eof || ( classifier.classify( line ) == "Ignore" )
}
Here’s how I executed it.
$ cat access.log.2007-02-11 | ruby ./filter.rb watch.txt ignore.txt
It actually worked pretty well. It identified 123 lines of interest in my log file, including the 10 that I used to train it. Most of the lines are from Yahoo Slurp and Googlebot. Only 27 of the lines have the word robot in them. A few minutes of analyzing the file, and it appears that there were only 11 false positives. Not too shabby for a first try.
There’s probably more “rubyesque” ways to have pulled it off. I really didn’t like the syntax that I had to use in order to read in all the lines from STDIN. There’s two or three really intuitive ways to do that in Perl, and I just wasn’t able to find one in Ruby tonight.
