Noise or Music? - The Insights Blog

Remove PII from Google Analytics

Categories: Google Analytics and GTM, Privacy and Accuracy / Comments: 26

Share Button

This is my extension to the GTM Tips post by the excellent Simo Ahava (his post: Remove PII From Google Analytics Hits).

Essentially, I had been looking for a way to block Personally Identifiable Information (PII) hits at the collection level i.e. using GTM, before the hit is sent to Google Analytics. Why do this? Putting the obvious requirement to not gather PII to one side, if you are adding filters to your GA Views in order to delete PII it is too late – the problem has already occurred. That is, if you have already sent the personal data to Google, then you have already broken the GDPR compliance!

Previously, by using GTM I would simply drop any hits containing page URLs with an @ symbol i.e. in case the URL contained an email address. Apart from being quite blunt (not all URLs with an @ symbol contain an email address), this approach would not tackle email addresses being present in other hit types e.g. events, e-commerce data etc. It also did not tackle other PII types – such as telephone numbers, zip codes, usernames etc. Hence, the much better approach of Simo’s method – using GTM’s new customTask feature – was very interesting to me!

In this post, I extend his method by building out the regex more – for a more sophisticated email detection, and to capture other PII types…

Redact, rather than remove PII

The important thing here is to remember we are redacting the PII – not blocking or removing it. This is an important distinction. If PII is present, it is almost certain that the same PII is being logged elsewhere on your network – your web server logfile at the very least. Reporting this in your Google Analytics in redacted form means you have a monitoring system to flag to your web dev/IT team in order to fix and keep on top of. Essentially, to be compliant, PII issues need to be fixed at their source by your organisation. Alternatively, if you deleted the PII data from your reports is simply stopped collecting it in GA, you would metaphorically be sweeping the problem under the carpet.

Here is my adjusted code for your Custom JavaScript variable.

IMPORTANT: This is a straight replacement to Simo’s code. Replace domain\.com with the domain of your website (lines 6 and 10). More on what this is for later. As always, when working with code its up to you to test it and ensure it works correctly. No liability accepted!

function() {
return function(model) {
// Add the PII patterns into this array as objects
var piiRegex = [{
name: 'EMAIL',
regex: /[^\/]{4}@(?!domain\.com)[^\/]{4}/gi,
group: ''
},{
name: 'SELF-EMAIL',
regex: /[^\/]{4}@(?=domain\.com)[^\/]{4}/gi,
group: ''
},{
name: 'TEL',
regex: /((tel=)|(telephone=)|(phone=)|(mobile=)|(mob=))[\d\+\s][^&\/\?]+/gi,
group: '$1'
},{
name: 'NAME',
regex: /((firstname=)|(lastname=)|(surname=))[^&\/\?]+/gi,
group: '$1'
},{
name: 'PASSWORD',
regex: /((password=)|(passwd=)|(pass=))[^&\/\?]+/gi,
group: '$1'
},{
name: 'ZIP',
regex: /((postcode=)|(zipcode=)|(zip=))[^&\/\?]+/gi,
group: '$1'
}

];
// Fetch reference to the original sendHitTask
var originalSendTask = model.get('sendHitTask');

var i, hitPayload, parts, val;

// Overwrite sendHitTask with PII purger
model.set('sendHitTask', function(sendModel) {
hitPayload = sendModel.get('hitPayload').split('&');
for (i = 0; i < hitPayload.length; i++) {
parts = hitPayload[i].split('=');
val = decodeURIComponent(unescape(parts[1]));
piiRegex.forEach(function(pii) {
val = val.replace(pii.regex, pii.group + '[REDACTED ' + pii.name + ']');
});
parts[1] = encodeURIComponent(val);
hitPayload[i] = parts.join('=');
}
sendModel.set('hitPayload', hitPayload.join('&'), true);
originalSendTask(sendModel);
});
};
}

Update: If you are having issues with encoded characters displaying in your Google Analytics reports for URLs and/or page titles, try changing line 41 of the javascript code (highlighted) to the one below:

val = decodeURIComponent(decodeURIComponent(parts[1]));

Once you add this variable to your Universal Analytics tags as the customTask field, any hits sent by these tags will be parsed by this variable, which replaces the instances of PII with the string [REDACTED pii_type].

For example, a URL with path:

/test?tel=+44012345678&email=brian@me.com&other=bclifton@DOMAIN.com&firstName=brian&password=hello

would be replaced with:

/test?tel=[REDACTED TELEPHONE]&email=b[REDACTED EMAIL]om&other=bcli[REDACTED SELF-EMAIL]OMAIN.com&firstName=[REDACTED NAME]&password=[REDACTED PASSWORD]

The Regex Changes Explained

-Extending the Email regex

For the EMAIL check, I make two changes to Simo’s original regex:

regex: /[^\/]{4}@(?!domain\.com)[^\/]{4}/gi,

Firstly, this matches any character that is not a forward slash / 4 times, followed by @. Then, so long as this is not followed by domain.com, it matches the next 4 characters which are not a forward slash.

So apart from looking for an email address, I am doing two extra things:

1. I exclude any “innocent” links that may be captured as outbound links containing an @. Common examples are Google Maps and Flickr links, which contain a forward slash – the [^\/] part. Example links:

  • www.google.com/maps/place/University+of+San+Francisco+-+Folger+Bldg,+101+Howard+St,+San+Francisco,+CA+94105/@37.7908871,-122.3925594,17z/data=!3m1!
  • www.flickr.com/photos/123456@N06/sets/721576344/Other PII data types

2. I exclude the domain of the website itself from this check using a negative look ahead – the (?!….) part. Remember to replace domain\.com with your own domain e.g. brianclifton\.com in my case. I match for this separately next.

My suggestion for a separate regex is to catch and redact any payloads containing the SAME email domain as the site itself, with a different “name” value to the regular email redaction. That way such emails will be reported differently in Google Analytics, allowing the site owner to ignore these and monitor real PII infringements.

For example:

  • If a visitor comes to my site and I capture their email address as simo@hissite.com, that is redaction_message [REDACTED EMAIL]
  • If a visitor comes to my site and I capture my own email address as an outbound click-through to the site owner e.g. mysite@brianclifton.com, that is redaction_message [REDACTED SELF-EMAIL]

As the site owner, the first message is the one I should be paying attention to. The second message (not really PII as it belongs to the site owner) keeps me compliant with Google’s terms of service.

For the SELF-EMAIL check, the regex is almost identical:

regex: /[^\/]{4}@(?=domain\.com)[^\/]{4}/gi,

The difference now is that I do wish to include my own domain in the match and this is achieved via a positive look ahead – the (?=….) part.

-Extending the regex to capture other PII

The original post by Simo was a simple pattern match – easy to use and maintain when you know the structure of the match you are looking for e.g. an @ symbol to match email addresses, or a well structured set of characters and numbers for strings like personal ID and social security numbers. However, I want to extend this to match less structured PII, for example people’s names, addresses, telephone numbers, zip codes etc.

To do this, we need a regex anchor. That is, a common string likely to contain such PII. I am assuming all such matches are contained within URL strings as query parameters (though name=value pairs in the URL path are also matched) e.g.

/test?tel=+46(0)12398765&firstname=Brian&zip=abc123

The anchor is the query name and we match for common PII culprits – these are tel, firstname and zip in my example. Of course these should be adjusted for your particular language. Anchors are the reason why the group key is required:

name: 'ZIP',
regex: /((postcode=)|(zipcode=)|(zip=))[^\/\?&]+/gi,
group: '$1'

In this case, $1 is the value of the string (our anchor) just before and including the = sign. We keep this in place for the data hit, and redact what follows. Without applying the grouping, the entire name=value pair would be redacted making troubleshooting difficult. I use [^&\/\?] in order to conclude the match within paths, or query parameters…

Happy compliance testing 🙂

BTW, you do you know I am building a data auditing and compliance tool to measure and monitor Google Analytics data quality, right?

[Thank you to the equally excellent David Vallejo for his JavaScript help – my skills are simply too rusty!]

Share Button

Comments (most recent first)

  1. nicolaos says:

    Thanks for this Awesome post ! Quick question: Why not treat EMAIL query as all others like firstname and telephone.
    In order words, why not remove all the value of the email query instead just the 4 keys before and after the @

    Thanks

    • Thanks for the feedback. The simple answer is because you can i.e. there is a nice anchor (the @) to use that other fields don’t have. That way, the owner of the data is able to understand exactly what email addresses are being captured. For example, it could be legitimate mailto links to resellers i.e. not really PII. There is just no way to do this with other potential PII.

  2. Ankit says:

    Great post.
    Just one simple question.
    Do you know how I can use customTask for ‘AdWords Remarketing’ tags (to redact PII)? I use it very well for ‘Universal Analytics’ tags.

  3. Marco says:

    Hi Brian,

    Thank you very much for posting this useful information.

    I am quite interested on implementing this mechanism on my website, which is coded over PHP instead of Javascript. Do you know if this is a limitation to implement GTM (i think the frame of code is just available on .js coding?) and consequently your custom JavaScript variable?

  4. Travis B says:

    Any way to go about doing this in GA without using Tag Manager?

    • You really want to do this at the point of data collection i.e. via GTM (or other tag manager solution) and not once the data is already in your GA account. Essentially, once you have collected the PII data you have already broken data privacy laws – regardless of where it is then stored/processed.

      However, note this technique is a monitoring system for you – using GA to redact and then flag up PII issues with your website. Ultimately you these need to be flagged to the IT/Web Dev team to sort out the underlying issue as even if the data does not get into GA it is very likely to be in web server log files, router log files and firewall log files etc.

      • Travis B says:

        Thanks Brian! I’m actually having to use analytics.js just for the sake of the way our e-commerce tracking is set up. I found out I can load analytics.js and then the ecommerce.js plugin without sending a page view. and then use GTM for page views and setting up this PII flagging. Great post! Thanks for the response.

  5. Matt says:

    Thanks for sharing! I really like the idea of capturing more types of PII. However, it seems like this code snippet only works for the EMAIL and SELF-EMAIL patterns. For the other types of PII, it seems like the regex pattern isn’t getting applied to the right piece.

    It looks like the line
    “`
    hitPayload = sendModel.get(‘hitPayload’).split(‘&’)

    breaks the URL into an array of strings, removing and splitting it at any `&`’s.

    So, a starting URL of:
    “`
    /test?tel=+44012345678&email=brian@me.com&other=bclifton@DOMAIN.com&firstName=brian&password=hello
    “`

    Would look like this as a JavaScript array:
    “`
    [
    “/test?tel=+44012345678”,
    “email=brian@me.com”,
    “other=bclifton@DOMAIN.com”,
    “firstName=brian”,
    “password=hello”
    ]
    “`

    You then cycle through this array with a for loop:
    “`
    for (i = 0; i < hitPayload.length; i++) {

    }
    “`

    Inside the for loop, you break each string into another array of strings with by the `=`:
    “`
    parts = hitPayload[i].split('=');
    “`

    So a piece of the larger array like:
    “`
    "firstName=brian"
    “`

    Becomes:
    “`
    ["firstName", "brian"]
    “`

    You then decode and assign to `val` the second item in the array.
    “`
    val = decodeURIComponent(unescape(parts[1]));
    “`

    Here’s what looks like a problem to me: only "val" gets the regex pattern applied to it:
    “`
    piiRegex.forEach(function (pii) {
    val = val.replace(pii.regex, pii.group + '[REDACTED ' + pii.name + ']');
    });
    “`

    The code seem to be cycling through the PII regex patterns and applying both the pattern and replacement only to the values of any query parameters, but not the query. The anchor needs to be searching for terms like `firstName`, but I think it’s only seeing values like "brian" all the way down.

    You then encode the value again, join each part back by the =, and join everything back up into one url with the &.
    “`
    parts[1] = encodeURIComponent(val);
    hitPayload[i] = parts.join('=');
    } // end for loop
    sendModel.set('hitPayload', hitPayload.join('&'), true);
    “`

    Am I missing something? It seems like the regex pattern is only getting applied to the second part of `parts` or the value in the query parameter (e.g. “brian"), but never the full query parameter (e.g. “firstName=brian")

  6. Simms says:

    Great post but one small question.

    Is it possible to just make this apply to GA and not our own software system. Is this code a blanket block for all tags or could some tags be on. White list?

    • Hello @simms – this method uses a Custom JS variable that is applied via Google Analytics’s new customTask feature. You apply it to which ever GA tags you need, but this will not work with non-GA tags unless the customTask feature is available to use. Note: customTask is a very new feature (Aug 2017) and I am not aware of other tags that can use this.

      That said, if you are collecting PII in plain text e.g. within URLs (regardless of where you are sending it), then you have a privacy issue as it will be logged by default on your webserver and on every router the URL passes through…

  7. Stu Bowker says:

    Thanks Brian. Might be worth adding ‘username’ too.

    • Yes exactly. I would customise the regex for your environment. Accountname, uname, user_name, customer etc., are all English possibilities. And if you work in multiple markets you will need to consider others as well e.g. kund(er) for the Swedish market. Essentially, a generic regex for all users doesn’t really work, it needs to be tailored…

  8. Jon Hibbitt says:

    Thanks! Awesome post and nice enhancements on Simo’s original. This will be going into the toolbox for sure.

  9. Hi Brian,

    I’m surprised that you included [redacted self-email] as an issue with regards to GA TOS. I would argue that the Terms of Service are concerned with personally identifying visitors to your site, and are not concerned with the personal details of the users of the website itself.

    As such, I find the customTask to be way to broad with regards to how it redacts information. My take is that doing things such as tracking who the user is trying to contact is not problematic. That applies to phone numbers tapped and mailto: links clicked. Additionally, while I would agree that a users address is out of bounds for data collection in GA, I don’t see how a postal code is a problem. Please argue the other side with me in the comments; I’m open to hearing it.

    Hope you’re doing well.

    Yehoshua

    • Hej @Yehoshua – although capturing your own orgs’ email addresses with GA reports is not strictly PII, it would likely be picked up by Google’s compliance robots. And who knows what that would result in – possibly data being deleted, your account suspended etc. Although you would be correct to argue the fine nuance of the reality, my suggestion is to avoid it in the first place i.e. good luck talking to a human compliance/legal officer at Google to make your case!

      In terms of zip codes, these can get very specific. For example, in the UK they can be limited to a set of 5 houses. Also, I found this article specific to Canada:

      “…it was found that 87.9% of the postal code locations were within 200 meters of the true address location (straight line distances) and 96.5% were within 500 meters of the address location (straight line distances).”

      https://ij-healthgeographics.biomedcentral.com/articles/10.1186/1476-072X-3-5

      • Stu Bowker says:

        I believe that capturing the first part of a UK postcode is OK because it’s at a city level, not street. For example, if the postcode equals “BA1 1AA”, then collecting just “BA1” in GA is fine.

        • Most likely true for the UK (always check such things with your legal/compliance team) and the regex can be customised accordingly. I would suggest if the post code is present, then the street address should also be checked and redacted…

      • Milos says:

        Precisely. As one GA expert once told me, don’t f*k with it 🙂
        If there is a remote possibility that GA will interpret the data as PII, don’t import it. Great post.

Leave a Reply

Your email address will not be published. Required fields are marked *

Anti-spam question (required):

© Brian Clifton 2018
Best practice privacy statement