April 3, 2011.
Despite a plethora of options, people keep reimplementing the analytics wheel. There are even, one might venture, probably some good reasons for doing so. But rather, than discuss the merits of building analytics systems, I want to discuss dealing with robot and scripting traffic from the homebrew solution you already built.
There's nothing quite like seeing your site's traffic has doubled over the last few weeks. However, when you're dealing with a freshly rolled-out hand-rolled analytics system, digging deeper you're likely to find a IPs who are performing as many requests as the other ninety-five percent combined. (This is particularly true if your site is large enough to be individually targetted for SEO, social media or plain old garden variety spam.)
Fortunately, automated traffic polluting your analytics is relatively straightforward to get under control. The solution will vary primarily on if you are attempting to gather real-time analytics or if you are performing periodic rollups of logs. (Or, mayhaps, some kind of hybrid.)
Generating analytics via periodic rollups is a fairly straightforward process. A simple approach is to find the median requests per IP and treat that median as the cutoff for requests you'll use in your rollups. This approach has (at least) two issues: first, it will give an extremely conservative number, and second it will hide interesting behavior in your most engaged (but still legitimate) users.
If you assume that the number of total unique IPs is much higher than the number of illegitimate IPs (this seems to hold true: the number of pageviews from each of those illegitimate IPs may be remarkably high, but there are relatively few of them), then you can use the more liberal calculation of capping the maximum acknowledged pageviews per IP at mean (rather than median) pageviews plus a standard deviation. This introduces a small amount of deviant behavior, but as you notice how few IPs are generating the vast majority of your automated traffic, you can calculate the expected overage, which will tend to be quite low. (If you're of a slightly lazier bent, you can simply calculate the cutoff once on a representative day and treat it as a constant rather than running the queries to recalculate it each day.)
With that simple change, your analytics have gotten quite a bit more actionable. Sure, some hacker with four AWS instances is using 25% of your site's resources, but at least you are no longer pretending your engagement is skyrocketing.
Both MySQL, Hive and most other databases you'd be using to store your analytics have functions/UDFs for the mean and standard deviations, so these approaches are pretty straight forward to implement in practice.
For filtering automated traffic from a real-time analytics system we'll want a similar approach to the one used for periodic rollups, but without the benefit of being able to reason about the complete set of actions a user will perform in a given period. We also don't have the benefit of redoing the calculation later once the period has ended (e.g. after the day ends), which is a strong argument for running a hybrid where real-time data is surplanted by periodic rollups as they become available. (With the hidden cost that the changing numbers would confuse the hell out of most users of your analytics.)
As such, our solution will to need to maintain running totals for each metric being tracked, as well as tracking each IP's activity within a recent period to know when to begin discarding its activity. (It's interesting to look at Interactive Advertising Bureau's guidelines on the topic., as pointed out to me as a resource to study by an old coworker, Anton Kast.) As automated activity tends to be generated in concentrated bursts, a good place to start is to only count the first action each minute for a given IP.
Such a solution can be implemented using Redis to store the metric counters without expirations, and tracking IP activity in keys which expire a minute into the future. Adjust the approach slightly and you can create more complex solutions (only count fifty actions per IP per day), but in general a simple one-action-per-minute approach should do an adequate job of filtering, and will only require tracking a minute's worth of unique IPs rather than an entire days.
How are you filtering activity out of your analytics?