Universal Analytics and user-centric analytics - Why mixpanel and KISSMetrics should worry
NB: The following blog post was initially written as a comment to an article on Google Universal Analytics, the new version of Google Analytics.
Lots have been said about the new Google Universal Analytics. Initially beta-only, the service has been available to all analytics subscribers since March 2013, and reviews and how-tos are now starting to pop everywhere on the web. One thing that we’ve been particularly interested in reading about UA is Google’s publicized “shift from visit-centric analytics to user-centric analytics”. Comparing Google’s documentation with official videos and comments on the web, it wasn’t entirely clear what that meant and, more specifically, how we were supposed to go about implementing this change. Of course, with the platform not entirely finalized and lots of features still to come, we can only offer hypotheses. If we’re right though, Google’s move may prove to be much more than just an alignment with what companies like mixpanel and KISSMetrics currently offer, and may in fact considerably improve the reliability of user-centric analytics.
Visitor? What visitor?
Analytics are a complicated subject, and for me things got a little clearer when I tried to specifically understand the meaning of what is commonly referred to as “visitors”. What is it, after all, analytics services call a visitor? Looking at the evolution of web analytics with that question in mind really helped me gain a better understanding of what UA potentially means for user-centric analytics.
Classic Analytics: visitor = clientID
First there was cookie-based analytics, made vastly popular by Google Analytics (or as they now call it, “Classic Analytics”). With Classic Analytics, “visitors” are in reality devices, or clients (as in client-side): any place where a cookie can be left (typically, a browser). A client hits your site: if it has your cookie, the tracker gets its ID and sends a hit with this ID. If it doesn’t, it creates a ID for you, stores it in a cookie and sends a hit with this ID. All that with the commonly accepted caveat that resetting cookies on a client means it’ll be recognized as an entirely new client for the future. Simple enough.
Mixpanel: visitor = clientID, or userID, or both
Then people started to ask for report consolidation based on users, instead of clients, and we got companies like mixpanel and KISSMetrics. Mixpanel and KISSMetrics have a mixed approach: at the core, they track visitors the good old way, using cookies and client IDs. Each client (as in device, not customer) has an ID, stored in a cookie, and whenever it hits your site the tracker sends a hit with the ID. What mixpanel (and KISSMetrics, but I’ll stick with the former since I know their API better) adds, though, is an additional layer that lets you identify visitors differently. Through cookie-based identification mixpanel has a way to say ‘I know this client’ ; through your auth system though, you also have a way to say ‘I know this user’: why not use this info as well? As we’ll see, the hard part is how you mix these two sources of identification.
With the alias method, Mixpanel chose to define a visitor as “a clientID, or a userID, or both”. If a visitor is only defined by a clientID, it’ll be identified by mixpanel using standard cookie-based tracking. If it’s only defined by a userID, it’ll have to be identified by you through the mixpanel.identify(userID) method. If it has both a cilientID and a userID, it’ll be identified in both situations, i.e. when mixpanel recognizes the corresponding client or when you recognize the corresponding user. Of course, the mixpanel.alias(userID) method is here to associate a client and a user: when you call it, mixpanel will read the clientID from its cookie, get the corresponding visitor and add the userID to it.
This is where things get tricky. For one, there can’t be 2 [clientID, userID] couples with the same userID: with the way mixpanel does things, this is essentially a technically impossible scenario (using mixpanel.alias(userID) with the same userID on several clients isn’t supported, as indicated in their docs). And yet one user can access your site through different clients, leading to a systematic overestimation of the number of visitors hitting your site. Note that this problem is shared by Google Analytics, and I’m not saying there’s a solution to that - I only think the success of mixpanel at tracking individual users is highly overestimated.
Secondly, and this is also by design, there can’t be 2 [clientID, userID] with the same clientID: if you call mixpanel.alias(userID) on the same client with 2 different userIDs, it’ll simply update the visitor with clientID = the ID in the mixpanel cookie twice, each time with the submitted userID. Which leads us to problems with public-access computers, or more generally shared clients.
Maybe more of minor problem (but still a problem), clients are identified using a cookie-stored ID. What if a user empties their cookies on the client they created their account from? Said client gets a new clientID that’s not aliased, and since I can’t ever alias it (see my first point), non-logged hits from that device are off my radar forever.
Mixpanel’s basic assumption that you can associate a user with a client is simply wrong: not only can the same user access your site through different clients, but several users can also access your site through the same client.
And that’s when Universal Analytics comes in.
Universal Anaytics: visitor = … well, we can’t really measure visitors, can we?
The conclusion from the above seems to be that there’s no such thing as a “measurable visitor”. Or at least, not in the sense we usually interpret it, which is “a distinct user, logged in or not, visiting our site”. If you use userID to measure visitors, you’ll ignore all the logged-out hits. If you use clientID, you’ll ignore the fact that users use different clients to access your site and fail to consolidate cross-device data. If you try to use both, you’ll fall in one of the mixpanel traps. So where does that leave us?
This Google I/O video offers clues (jump to minute 12). Google’s answer, it would seem, is for any hit they record, to store the originating [clientID, userID] couple and interpret the data later. Note that in a classic browser scenario clientID will always be set (since it is calculated by analytics.js), while the userID will be set only if you choose to set it (in most cases when users are logged in).
This approach gives Google many advantages: they can now see things like how many different devices each user utilizes on average to access your site (or “device overlap”). Or how many users access your site from shared computers (shared by, on average, how many users). And while we have yet to define what can be considered a visitor in that circumstance, Google can always reconstruct the notion at will from the data: for starters and rather obviously, all [clientID, userID] couples with the same userID would represent the same visitor. They’d then have to interpret [clientID, null] couples which is a little harder, but they could use several strategies (and refine them as they go, since they’ve got the underlying data anyway):
- all [clientID, null] couples with the same clientID could represent the same visitor, which is the standard, Classic Analytics status quo
- or [clientID, null] couples that share their session with a single [clientID, userID] couple could be attached to the corresponding userID
We of course still have to wait to see what final implementation will come out. Until UserId is here, it’s hard to judge whether or not Google succeeded in their attempt to make UA a more user-centric analytics platform. In the meantime, we need to be patient.
One thing in particular that we’ve seen on several sites and that we think is a misuse of UA is the use of clientID as a substitute for the still-missing userID. If our interpretation is correct, clientID is used to identify, well, the client from which hits are sent. Not the user. Even though manually overriding clientID with a backend UUID does have the advantage of consolidating cross-device data, in the process you lose information on the various clients used by your users to access your site, how many they used, if several users used the same client, etc. Using userID in conjunction with clientID will let us have the best of both worlds - in the meantime, well… we can only wait!