Sorry, we could not find the combination you entered »
Please enter your email and we will send you an email where you can pick a new password.
Reset password:
 

free

 
By Thomas Baekdal - July 2019

GDPR: How publishers can track things without tracking people

A few days ago, I wrote a rant on Twitter about the way podcasts work, which I said was out-of-date and idiotic. But as part of that, one of the things I said was that we publishers need a direct way to measure podcast analytics, and my proposed solution was to define an 'analytics ping' URL that every podcast player would activate every five seconds.

Then we have analytics ... and oh boy that sucks. Imagine if we in the podcast RSS could simply define an 'analytics-ping' URL, which would trigger, across podcast players, every five seconds.

The idea here is that instead of having a bunch of totally disconnected and completely useless forms of podcast analytics, as we have today, each publisher could define their own, and have data from everywhere.

As a publisher, of course, this is something that I consider to be essential, but as you can imagine, a number of people expressed outrage at the idea of every podcast player sending out analytics data about their listening habits.

As one put it:

Why would I want to permit that? What makes you think yourself entitled to monitor my private listening habits?

Now, I can understand why people react this way, especially considering the 'normal' practices we see around data in the US. But this person missed two very important things.

First of all, that I'm an EU citizen, which means that I live in the glorious land of EU regulated GDPR. This means that, by law, I'm not entitled to any private data of any kind.

Secondly, what I am talking about has nothing to do with 'private data'.

In fact, there is a perfectly usable way for publishers to track very detailed podcast analytics, even per person, without involving any private data or personal tracking of any kind.

Yes, I know that sounds like a contradiction, but I know this because I have already implemented such a system here on my site.

One example is with my newsletter where I track everything. I track how many people I send it to (obviously); I track when people open it and how many times they open it; the time between opens; what links they click on, and other factors like subscribe and unsubscribe rates.

And yet, my newsletter analytics system does not track any personal information or even do any personal tracking.

Yes, I know that sounds crazy but, in this article, I will explain exactly how that works, and why other publishers should think about doing something similar.

GDPR in the extreme

So, for those of you who don't know the history behind this, let me just very quickly recap.

I'm a media analyst, and part of my job is to detect and predict trends and advise publishers, which means that understanding what GDPR is all about is essential to my work since it has a big impact on how publishers can work with personal data.

So, back when GDPR was about to go into effect, I decided to do something crazy. I decided to take this to the max, and I redesigned everything about my site to not only be 100% GDPR compliant (obviously, since I am legally required to do so), but also to create a site that would put privacy at the front of everything.

What I didn't want is what most other publishers have done, which is to not really change anything, and then just put up one of those really annoying and terribly user-hostile GDPR dialogs that you see everywhere.

As I have written in the past, this is the worst possible thing that any publisher can do, because you are basically telling your readers that you don't give a shit about their privacy. You just want their permission to send personal data to hundreds of unknown and mostly unaccountable 3rd party trackers.

I did not want to do this.

Instead, I wanted to create a site where I wouldn't have to do this, and the way you do that is to remove all forms of personal data from your analytics to begin with.

If you are interested, I detailed all of this in a very lengthy article from last year, where you can read about all the steps I took to make this happen: Inside Story: What I did to get GDPR Compliant

So this is my baseline. I have a site where I'm not only GDPR compliant, I have taken this a step further and made it 'privacy first'. By default, I do not collect any personal data of any kind.

But, of course, as a media analyst, I also love data, and I still wanted to be able to tell how people were using my site. So, I had to come up with a new form of analytics. And how I measure my newsletter is a good example of that.

Newsletters and one-time IDs

So, let's talk a bit about tracking. The way tracking usually works is that you have something in your databases that links the actions people take to each person.

For a normal newsletter system (like what you find at Mailchimp), it usually works something like this:

First you have a database with a list of all the people who have signed up for the newsletter. It might look like this:

What you have is an ID that uniquely identifies each person, the name of the person and their email (all personal information), and what newsletter list they have signed up to.

You might have other fields too, but these are the important ones.

And what most newsletter systems then do is to attach this unique ID to each email, so that whenever you interact with the email, the ID is sent with it.

The result is this:

So here you can see that user '84C59A6E...' (which is me) has opened the email twice (two entries) and clicked on two links the first time and one link the second time.

You can also see that user 'E6BA02DE...' (Hannah) has opened the email once, but didn't click on any links.

And this is how almost all email systems work. There is some form of ID attached to each email that is sent with the information so that the publisher can know who did what and when. And this ID can then be linked between data sets, which is essentially how all forms of tracking work.

Of course, in this case, it's an ID, but it could be many other things ... even just your IP address (although that is very unreliable), or in a more advanced way, some form of 'browser fingerprinting' that identifies what type of computer/browser you are using, and then use that to link instead.

And if you are a publisher and you do anything like this, GDPR comes into effect, and you need to explicitly get consent to track people this way (which is why all Mailchimp sign-up forms include this exact request).

In other words, the problem here is the link between data sets. That's where the privacy problem comes in.

As I said, I did not want to do this, but I still wanted all the data. So how can I do that?

Well, it's easy... just get rid of the ID that is enabling the link. Like this:

Now we are still collecting all the same data, but because there is no ID, we can no longer track this activity to any specific person, and just like this, we have 100% privacy.

But, for those of you who know a bit about analytics and databases, you will also instantly have identified the flaw with this system.

By removing the ID, we also removed the ability to tell people apart, which means that we have no idea if the above is one person visiting three times, two people, or three people ... and from an analytics perspective, this is just completely useless.

From a data perspective, we need to be able to tell people apart, otherwise there is no point in getting this data to begin with.

So how do we fix that? How do we track emails 'per person' without also linking it to each actual person?

Well, again, the solution is quite simple. You just create a unique ID linked to each email, but that doesn't link to any one person ... a one-time and one-way ID.

Like this:

Now, as you can see, there is an ID attached to each email, which allows me to divide up all the activity per person, which in this case means that I can see the first two metrics are coming from the same email... which in turn means that this email was opened twice.

But, since this ID is a one-time and one-way ID, it doesn't match with anything in my user database, and as such, I cannot link it to any specific person.

In other words, as a publisher, I am fully aware of how many emails I sent out, how many times they were opened, by how many people, and what each person did.

But my system is built in such a way that I do NOT know who is behind any of the specific metrics. As a result, you get 100% privacy by default, while I, as a publisher, still get the data I need, and on top of that, I don't have to deal with terrible GDPR consent dialogs, because no personal data is tracked at all.

We all win!

And this is a fundamental part of all my analytics tracking. For instance, as you have been reading this article, a system has been monitoring your 'read rate' behind the scenes.

As such, I now know that you have reached this part of the article, and how long it took you to get here. But again, this has been implemented using a one-time and one-way ID, so my analytics system has no idea who 'you' are. There is no link or data collection to anything that would be able to track you personally. No ID and not even your IP address.

The point here is that we can actually measure how people interact with our sites without tracking people personally.

So... when the person said to me:

Why would I want to permit that? What makes you think yourself entitled to monitor my private listening habits?

My answer is that I don't. I don't feel entitled to do anything like this at all. But what people need to understand is that measuring analytics and 'monitoring private listening habits' are not the same thing. And here in Europe, it legally cannot be the same thing.

GDPR dictates that personal data must be separate from the business, it can only be collected when you have explicit consent, and even then only in a limited way that is necessary for the product or service to be provided.

So as publishers, we must come up with a way to make sure we can still work even without collecting any personal data. But that doesn't mean we can't collect data.

This is the message I'm trying to send. 'Data' and 'violating privacy' are not the same thing.

What are the downsides?

Obviously, what I'm doing here in this site is taking things to extremes, and it does add some limitations.

For instance, when you no longer track people individually, you are also not capable of doing what we call 'profiling'. And while I can see all the critical data about how people read my newsletters, I am not able to track my readers across several campaigns.

Since each email is linked to a one-time ID, the next time I send out an email, it will have a different ID. As such, I'm fully able to tell what is happening within each newsletter, but I have no idea how individual people interact with my newsletters over time.

This creates a problem, for instance, when trying to analyze things like churn. You can still measure churn, of course (and I do), but it takes away the ability to do deep personal analysis of what led to it.

This is a problem.

What's interesting though, is that once you put this limitation on yourself, you realize that there are a lot of other things you can do to get some of those insights back, without ever profiling people.

For instance, you can do pattern analysis instead.

Pattern analysis is where, instead of getting the actual personal data, you just look at the changes in patterns. And by measuring this, you can infer a cause as to what is going on without ever tracking the people who did it.

You can also track metadata linked to attributes about a person, in a non-identifiable way that doesn't link to who that person is. Over at Apple, they call this 'differential privacy' (although there are other forms as well).

Let me give you a very simple example of this from this site.

I obviously know how many subscribers I have, and my servers also know when you have logged in. I need to know this in order to be able to unlock and show you my Plus articles. But, again, I didn't want my analytics to be able to track you personally.

So how do you do that? Well, it's simple. You record whether an activity is done by a subscriber or not, but you don't record who that subscriber is.

In my analytics system this is measured as a simple number. I have a field in my database that says "userValidated", and if that is a "1", it means that this activity was done by a validated subscriber.

This way, I have the ability to accurately segment my subscribers and my free traffic (which is very important), but I'm not tracking anyone personally.

There are a lot of things like this that you can do, where the data that you collect has been processed before you collect it, so that you end up with the information you need to have, without ever needing to record who the person was.

Finally, of course, you can do sampling. And I do this too.

In the digital world, sampling is kind of a new thing, because we have gotten so used to the idea of just measuring people personally, but sampling is a very old concept.

Think about something like TV ratings. The way it works is that you have a somewhat small group of people who have explicitly agreed to have their TV viewing tracked, which you then use as a sample to adjust your larger datasets that don't include any personal information.

Or think about polling. Here you go out and you ask maybe 2,000 people about something, and from that you can very accurately predict what everyone would do within a few percentage points of accuracy.

Wait a minute, you say, polls are not accurate. But actually they are. Pretty much all polls will very accurately predict a result within a few percentage points, but the problem with political polling is that the outcome of an election often falls under that threshold. So political polls often seem inaccurate, but they are not.

For instance, during the last US Presidential election, in Michigan, the difference between the actual votes the two candidates got was 0.3%. So if your poll is accurate within 3%, you could accurately predict either side winning.

This doesn't mean the poll is inaccurate (it did predict the result within 3%), it's just not very useful when an election is that close.

For publishers' analytics, we have no such problem, because we are not trying to measure the outcome of an election.

So sampling can be a very useful tool for publishers, especially when it comes to generalized data (like how many visitors you actually have, or how many times people actually come back).

On this site, the way I'm doing this is that I'm asking a small subset of my subscribers if they will agree to be tracked, and this gives me better insights that I can then use to adjust my larger non-personalized dataset.

You can read more about this in my article about how I implemented GDPR.

My point is to illustrate to you as a publisher that there are other ways of measuring analytics, and that you can create a fully working site, where you are measuring how people use it, without jeopardizing people's privacy.

If we look at the trends, this will become vital for the future. And with this article, I hope I have enticed you to realize that there are other ways to track data. I'm not saying changing this will be easy. But publishers need to wake up to the reality of what people demand today. You are writing about it in your newsroom, but you are not doing any of it yourself.

We need to change how we think about tracking. But most of all, as publishers, we need to be better than the rest of the market when it comes to privacy.

 
 
 

The Baekdal Plus Newsletter is the best way to be notified about the latest media reports, but it also comes with extra insights.

Get the newsletter

Thomas Baekdal

Founder, media analyst, author, and publisher. Follow on Twitter

"Thomas Baekdal is one of Scandinavia's most sought-after experts in the digitization of media companies. He has made ​​himself known for his analysis of how digitization has changed the way we consume media."
Swedish business magazine, Resumé

 

—   analytics   —

free

analytics:
GDPR: How publishers can track things without tracking people

plus

analytics:
Machine Learning is like black-magic for publishers

plus

analytics:
Dwell time, watch time, and the new world of audience analytics

plus

analytics:
Advertising Analytics is from Mars; Subscriber Analytics is from Venus

plus

analytics:
Everyone Measures Conversions the Wrong Way. Let's Fix That!

plus

analytics:
How Small Publishers Should Think About Advanced Analytics