Examining Web Analytics to Implement (repost)

June 21, 2007. Filed under writing 33

This is a transplant from the original Irrational Exuberance, and was written in mid 2007: nearly two years ago. To use the internet is to become a number. Not only a number in the sense of being stripped of distinction and being treated (or mistreated) by the invisible hands of machinery, but we also become numbers in the sense of a tracking number--we are labeled, tracked, and inadvertently leave behind us copious amounts of data about ourselves. In an entry in the Google Operating System blog the author mentioned Google has 220 terabytes of information from their Google Analytics program. That is second in size only to the databases used by their search crawler, which has 850 TB of information. (Google Earth only uses 80.5 TB.) At the current rate of growth, the Analytics entry will overtake the web crawler in size--thats a whole lot of data about who has been doing what, and where they did it.Okay, so thats pretty scary, what do we do about it? Well, we write our own analytics system, of course! Recently I have been building a web analytics kit for usage with Django, and early on it had a high potential--like most programming projects--for dying the death of a thousand little cuts. This article aims to look at the overall design of a web analytics system, and how one might be implemented. It will linger briefly on some particularly key details, but won't attempt to implement a web analytics system in code.(For those who are curious I will be brushing up my web analytics system for Django over the next several days, and will then be releasing it (with an open source license). I won't discuss it further in this entry, but intend to write several entries on it (implementation/design, installation, usage, flaws) in the next several days.)

"Why Not Just Use Google Analytics?"

Because this is the sort of response that I stay awake at night dreading, I figured I would premptively answer this question; someone was going to throw it at me one way or another. My answer consists of two of arguments, the first is about the technical limitations of Google Analytics, and the second is the privacy concerns raised by the vast scope of the Google Analytics program.The largest technical limitation I find in Google Analytics is that it only updates your statistics once per day, I frequently wish it would give real-time feedback, and it simply doesn't. This leads into the other technical limitation with Google Analytics: it is a proprietary system that I have no access to, and thus I cannot alter its code to behave how I'd like it to. I can't access the raw data to run my own analysis, I can't alter the visualizations, I simply have to take it or leave it. Given that I don't care for several of their choices, this isn't a purely ideological objection, but is also a practical concern (many of their choices are the consequence of the scope of their program, they simply don't have the processing power to generate real time updates, whereas by distributing the computation to local servers we can afford to do what they cannot).The privacy concern is also more than ideological pathology, this is a serious threat to privacy. Google Analytics is used on many, many websites (including this one, for the time being), and this (and its usage of third party cookies, discussed in greater depth later) allows Google to track a single user across a vast multitude of sites. Most of us already know that the internet is a lot less anonymous than we'd like it to be, but it can still be disconcerting to realize just how thoroughly our movements are recorded.The reader may point out that developing yet another web analytics solution hardly seems the solution to restoring the internet's anonymity, but indeed it can contribute. By using numerous disjoint analytics deployments that only track visitors to one specific site then website owners can still maintain site statistics, but each user's behavior will only be tracked at one website per deployment. Thus, even if a user is tracked on several different websites, it will be impossible to know visitor 12312 on site A is the same person as visitor 34455 on site B.For many the ease of use of Google Analytics will continue to be the only argument that matters, and I'm fully aware of that. As such my aim is to refine my analytics kit for Django to the point that it is both more powerful than, and easier to use than Google Analytics. Perhaps others will follow in implementing similar packages for PHP, Ruby on Rails, and other web development frameworks (why bother dreaming if your dreams are easily attained, right?).Well, enough with vainly attempting to preemptively thwart criticism, and on to examining the pitfalls and sundry decisions involved in developing a web analytics framework.

Categories of Web Analytic Solutions

There are two broad groupings of web analytics implementations. The first is consists of server logging software, and software that analyzes the server logs to extract statistics. The second is to have a javascript file inbedded in your page that transmits data to a server, along with a cookie to identify visitors (this is the approach used by Google Analytics). Both have their fallacies: logging software cannot distinguish between multiple visitors with the same ip address, and the embedded javascript is thwarted by users with javascript disabled, as well as by users who refuse cookies.Despite its failings, it is much easier to implement the javascript/cookie approach than to write your own server, or to analyze server logs (different servers create logs in different formats, which would make analyzing those logs a joyless expedition). (In my implementation I ended up taking something of a hybrid method, but that is a different article, so I must be strong and refrain from discussing it.) As such most of the following discussion will focus on how to design a javascript/cookie methodology.

1st Party Cookies vs 3rd Party Cookies

The reason that Google Analytics can track a user across multiple sites is because it uses third party cookies. This means that all sites that use Google Analytics share the same cookie on your computer. On the other hand using first party cookies mean that each website has its own cookie--more cookies for the end user, but less potential for tracking of user habits across multiple sites.When implementing a small scale web analytics system, first party cookies are a much easier choice. The have the added benefit of granting your users greater privacy. A win the implementer, a win for the visitor, who is left to complaint?

It's All About the Benjamins...eh... Data

At the end of the day, regardless of other details, web analytics are about collecting information about the traffic to your site. Deciding which data to keep track of is likely the most important decision you will make when designing a web analytics system. The two questions to consider are: What data is useful? and What data is reliable? Unfortunately these two questions can generate very different answers.We'll start by trying to answer what is useful: we want to know how much traffic each page receives, we want to know when we receive traffic, how the traffic finds us, how many visitors we receive, and what is the quality of visitors we receive. All of this information can be collected fairly easily, but the reliability varies drastically.We can very reliably record which pages receive traffic (hits per page), and how traffic finds us (referrers and searches keywords), but tracking and rating visitors is much less precise. We will still want to try to track those statistics, but we have to acknowledge that regardless of whose analytics we end up using, recording and assessing visitors is far more of an art than a science.These difficulties result from the fundamental design of the internet. We have relatively meager tools for distinguishing individuals from the mob: we have internet protocol addresses, we have cookies, and there simply isn't anything else (unless we can convince our visitors to install our crappy tracking software on their machines, which is to say, unless we have our visitors). IP addresses change over time, and can be shared among multiple individuals, further they can be spoofed by those who care to try. Cookies are easily deleted, often refused, and fairly unreliable. A user who deletes their cookies will appear to be a new visitor, and a visitor who refuses cookies will always appear to be a new visitor. Some hacks exist, but they are all compromises in one direction or another (too lenient or too strict, perfection will never been the trademark of internet programming).Likewise visitor quality is also distressingly subjective. Measuring time spent on a site is done by measuring the difference between the last page loaded and the first page loaded within the same browser session. Number of pages viewed per user depends on the accuracy of tracking users (and is thus damned to inaccuracy from the start). Bounce rate (percentage of users who leave after viewing only one page) is also dependent on accurate tracking of users. Not to hammer the point home too resoundingly, but none of these statistics will ever be particularly accurate (this is one of the reasons Alexa's solution is somewhat tantalizing: they can accurately track users and thus they can keep these statistics accurate to a degree that no level of server-side ingenuity can match).Well, acknowledging that much of the data we are tracking is going to be moderately accurate at best, we still have to decide on which pieces of information to try to track.In designing my system I settled on these statistics for tracking: number of visitors, pageviews per page, hits per referrer, and number of pageviews per period of time (I also intend to track search engine queries, but need to collect further raw data on identifying and distinguishing queries). One certainly could keep numerous more numbers, and might want to--for example--keep track of exactly when each pageview occurred. That seems fairly reasonable, but--at least initially--I wanted to create a system that could create a summary of its analytics data without doing many database queries and without needing to coerce the data into a meaningful form.The final decision of how to handle data is how to display the collected data in an easily assimulated format. Google Analytics uses a custom Flash interface, and it is, to be certain, very slick. Despite this, many of the details of your tracking data are simply unavailable. Thus, a combination of both pleasing graphics and hard numbers seems to be desirable: something that can fulfill the needs of both an experienced webmaster and a not-quite-savvy newcomer. (Personally I have been taking advantage of the Plotr javascript library, and I have been pleased with it).

Ending Thoughts

Last week I dove into web analytics, and its been a refreshing dip. Like all interesting problems, web analytics has the potential to be very useful, and comes with a number of awkward restrictions that must be worked around. It is (like all internet related fields) still a young field, and it is certainly a field in want of better solutions. As mentioned several times in the article I have been developing a web analysis system for the Django framework, and it has been a very enlightening experience. It certainly has a long way to go before being as functional as Google Analytics, but it can already do somethings Google Analytics can't (for example, update statistics live instead of once per 24 hours), and it is open source so if you have a problem with it, you can change it to suit your needs. After discussing some of the numerous issues in web analytics in this article, I think one can readily imagine where two equally rational individuals might want to handle tracking visitors differently, or may want to count pageviews by the hour instead of by the day. The beauty of open source is they can fire up emacs and create their own version, and that is something that Google Analytics will never match.