This is a transplant from the original Irrational Exuberance, and was written in mid 2007: nearly two years ago.
To use the internet is to become a number. Not only a number in the sense of being stripped of distinction and being treated (or mistreated) by the invisible hands of machinery, but we also become numbers in the sense of a tracking number–we are labeled, tracked, and inadvertently leave behind us copious amounts of data about ourselves. In an entry in the Google Operating System blog the author mentioned Google has 220 terabytes of information from their Google Analytics program. That is second in size only to the databases used by their search crawler, which has 850 TB of information. (Google Earth only uses 80.5 TB.) At the current rate of growth, the Analytics entry will overtake the web crawler in size–thats a whole lot of data about who has been doing what, and where they did it.Okay, so thats pretty scary, what do we do about it? Well, we write our own analytics system, of course! Recently I have been building a web analytics kit for usage with Django, and early on it had a high potential–like most programming projects–for dying the death of a thousand little cuts. This article aims to look at the overall design of a web analytics system, and how one might be implemented. It will linger briefly on some particularly key details, but won’t attempt to implement a web analytics system in code.(For those who are curious I will be brushing up my web analytics system for Django over the next several days, and will then be releasing it (with an open source license). I won’t discuss it further in this entry, but intend to write several entries on it (implementation/design, installation, usage, flaws) in the next several days.)
"Why Not Just Use Google Analytics?"
Because this is the sort of response that I stay awake at night dreading, I figured I would premptively answer this question; someone was going to throw it at me one way or another. My answer consists of two of arguments, the first is about the technical limitations of Google Analytics, and the second is the privacy concerns raised by the vast scope of the Google Analytics program.The largest technical limitation I find in Google Analytics is that it only updates your statistics once per day, I frequently wish it would give real-time feedback, and it simply doesn't. This leads into the other technical limitation with Google Analytics: it is a proprietary system that I have no access to, and thus I cannot alter its code to behave how I'd like it to. I can't access the raw data to run my own analysis, I can't alter the visualizations, I simply have to take it or leave it. Given that I don't care for several of their choices, this isn't a purely ideological objection, but is also a practical concern (many of their choices are the consequence of the scope of their program, they simply don't have the processing power to generate real time updates, whereas by distributing the computation to local servers we can afford to do what they cannot).The privacy concern is also more than ideological pathology, this is a serious threat to privacy. Google Analytics is used on many, many websites (including this one, for the time being), and this (and its usage of third party cookies, discussed in greater depth later) allows Google to track a single user across a vast multitude of sites. Most of us already know that the internet is a lot less anonymous than we'd like it to be, but it can still be disconcerting to realize just how thoroughly our movements are recorded.The reader may point out that developing yet another web analytics solution hardly seems the solution to restoring the internet's anonymity, but indeed it can contribute. By using numerous disjoint analytics deployments that only track visitors to one specific site then website owners can still maintain site statistics, but each user's behavior will only be tracked at one website per deployment. Thus, even if a user is tracked on several different websites, it will be impossible to know visitor 12312 on site A is the same person as visitor 34455 on site B.For many the ease of use of Google Analytics will continue to be the only argument that matters, and I'm fully aware of that. As such my aim is to refine my analytics kit for Django to the point that it is both more powerful than, and easier to use than Google Analytics. Perhaps others will follow in implementing similar packages for PHP, Ruby on Rails, and other web development frameworks (why bother dreaming if your dreams are easily attained, right?).Well, enough with vainly attempting to preemptively thwart criticism, and on to examining the pitfalls and sundry decisions involved in developing a web analytics framework.
Categories of Web Analytic Solutions
1st Party Cookies vs 3rd Party Cookies
The reason that Google Analytics can track a user across multiple sites is because it uses third party cookies. This means that all sites that use Google Analytics share the same cookie on your computer. On the other hand using first party cookies mean that each website has its own cookie--more cookies for the end user, but less potential for tracking of user habits across multiple sites.When implementing a small scale web analytics system, first party cookies are a much easier choice. The have the added benefit of granting your users greater privacy. A win the implementer, a win for the visitor, who is left to complaint?
It's All About the Benjamins...eh... Data
Last week I dove into web analytics, and its been a refreshing dip. Like all interesting problems, web analytics has the potential to be very useful, and comes with a number of awkward restrictions that must be worked around. It is (like all internet related fields) still a young field, and it is certainly a field in want of better solutions. As mentioned several times in the article I have been developing a web analysis system for the Django framework, and it has been a very enlightening experience. It certainly has a long way to go before being as functional as Google Analytics, but it can already do somethings Google Analytics can't (for example, update statistics live instead of once per 24 hours), and it is open source so if you have a problem with it, you can change it to suit your needs. After discussing some of the numerous issues in web analytics in this article, I think one can readily imagine where two equally rational individuals might want to handle tracking visitors differently, or may want to count pageviews by the hour instead of by the day. The beauty of open source is they can fire up emacs and create their own version, and that is something that Google Analytics will never match.