Despite a plethora of options,
people keep reimplementing the analytics wheel. There are even, one might venture, probably some good reasons for doing so.
But rather, than discuss the merits of building analytics systems, I want to discuss dealing with robot and scripting
traffic from the homebrew solution you already built.
The Pain of Bots
There’s nothing quite like seeing your site’s traffic has doubled over the last few weeks.
However, when you’re dealing with a freshly rolled-out hand-rolled analytics system,
digging deeper you’re likely to find a IPs who are performing as many requests as
the other ninety-five percent combined. (This is particularly true if your site is large
enough to be individually targetted for SEO, social media or plain old garden variety spam.)
Fortunately, automated traffic polluting your analytics is relatively straightforward to
get under control.
The solution will vary primarily on if you are attempting to gather
real-time analytics or if you are performing periodic rollups of logs.
(Or, mayhaps, some kind of hybrid.)
Generating analytics via periodic rollups is a fairly straightforward process.
A simple approach is to
find the median requests per IP and treat that median as the cutoff for requests you’ll use in your rollups.
This approach has (at least) two issues: first, it will give an extremely conservative number,
and second it will hide interesting behavior in your most engaged (but still legitimate) users.
If you assume that the number of total unique IPs is much higher than the number of
illegitimate IPs (this seems to hold true: the number of pageviews from each of those
illegitimate IPs may be remarkably high, but there are relatively few of them), then you can
use the more liberal calculation of
capping the maximum acknowledged pageviews per IP at mean (rather than median) pageviews plus a standard deviation.
This introduces a small amount of deviant behavior, but as you notice how few IPs are generating
the vast majority of your automated traffic, you can calculate the expected overage, which will
tend to be quite low.
(If you’re of a slightly
lazier bent, you can simply calculate the cutoff once on a representative day and treat it
as a constant rather than running the queries to recalculate it each day.)
With that simple change, your analytics have gotten quite a bit more actionable.
Sure, some hacker with four AWS instances is using 25% of your site’s resources, but at least
you are no longer pretending your engagement is skyrocketing.
Both MySQL, Hive and most other
databases you’d be using to store your analytics have functions/UDFs for the mean and standard deviations,
so these approaches are pretty straight forward to implement in practice.
For filtering automated traffic from a real-time analytics system we’ll want a similar approach to
the one used for periodic rollups, but without the benefit of being able to reason about the complete
set of actions a user will perform in a given period. We also don’t have the benefit of redoing the calculation later
once the period has ended (e.g. after the day ends), which is a strong argument for running
a hybrid where real-time data is surplanted by periodic rollups as they become available.
(With the hidden cost that the changing numbers would confuse the hell out of most users of your analytics.)
As such, our solution will to need to
maintain running totals for each metric being tracked, as well as tracking each IP’s activity within a recent period to know when to begin discarding its activity.
(It’s interesting to look at
Interactive Advertising Bureau’s guidelines on the topic.,
as pointed out to me as a resource to study by an old coworker, Anton Kast.)
As automated activity tends to be generated in concentrated bursts, a good place to start is to
only count the first action each minute for a given IP.
Such a solution can be implemented using Redis to store the metric counters without expirations,
and tracking IP activity in keys which expire a minute into the future. Adjust the approach
slightly and you can create more complex solutions (only count fifty actions per IP per day),
but in general a simple one-action-per-minute approach should do an adequate job of filtering,
and will only require tracking a minute’s worth of unique IPs rather than an entire days.
How are you filtering activity out of your analytics?