On Google Analytics

Posted by HEx 2014-12-03 at 03:13

On the web, the "traditional" method of visitor tracking is at the origin web server. This is perfectly reliable and trustworthy: it is impossible to visit a site without contacting the origin server1, and no other parties need be involved. Tools such as webalizer perform logfile analysis and produce visitor statistics.2

However, in the past decade a new model of visitor tracking has achieved prominence: third-party tracking via javascript, often with a fallback to transparent 1x1 GIFs ("web bugs"). This model is exemplified by the most widespread tracking tool, Google Analytics.

Google Analytics (GA) is overwhelmingly popular. At the time of writing, approximately half of the web uses GA.3 There is no user benefit to participating in GA tracking, but there are costs both in resource usage and in information leakage. As a consequence, google-analytics.com is also one of the most blacklisted domains on the web.

How blacklisted? Here is a sampling of popular ad-blocking and other privacy-centric browser addons, along with their approximate userbase.

All of the above tools block GA by default. This is not counting other, more niche tools, or the individuals and organisations who block GA manually in their browsers or at their firewalls.

This is not a negligible number of people. Notwithstanding the fact that an awful lot of people seem to not want GA to be a part of their browsing experience, these are all people who will be invisible to GA, and absent in the data it provides. That it is possible (and even desirable) to opt out of third-party tracking undermines the entire concept of third-party tracking.

That Google offers such a deeply flawed service is easy to understand. The question for Google is not whether the data it collects from GA is complete, but whether it gets to collect it at all. Google is the biggest infovore in history: of course websites outsourcing their tracking to Google is a win for Google.

And yet, in part because it is easy to use, GA is incredibly widespread. Given its popularity, GA would perhaps seem to be a win for site operators too. I think not, and here's why: deploying GA sends a message. If a site uses GA, it seems reasonable to conclude the following about the site operator:

  • You care about visitor statistics, but not enough to ensure that they are actually accurate (by doing them yourself).4 Perhaps you don't understand the technology involved, or perhaps you value convenience more than correctness. Neither bodes well.
  • Furthermore, you're willing to violate visitors' privacy by instructing their browsers to inform Google every time they visit your site. Since in return you get only some questionable statistics, it seems visitors' privacy is not important to you.

If you run a website, this is probably not the message you want to be sending.

[1] The increasing use of CDNs to serve high-bandwidth assets such as images and videos does not change the fact that 100% of visits to a modern, dynamic site pass through the origin server(s). Frontend caches such as Varnish merely change the location of logs, not their contents or veracity.

[2] There are very probably more modern server-side tools available nowadays; I've been out of this game for a while now. Certainly CLF leaves a lot to be desired. But since web servers have all the data available to them this is strictly a quality-of-implementation issue.

[3] http://w3techs.com/technologies/overview/traffic_analysis/all claims 49.9%. http://trends.builtwith.com/analytics/Google-Analytics confirms this number, and offers a breakdown by site popularity. Of the most popular 10,000, 100,000 and 1,000,000 sites, the proportion using GA rises from 45% of the top million to nearly 60% of the top ten thousand.

[4] This is an assumption. Deploying GA does not preclude the option of tracking visitors at your web server. But since web server tracking is strictly better than GA, why would anyone bother to use both?

Leave a comment