Noise or Music?

Remove PII from Google Analytics – The Smart Way

Categories: GA & GTM, GDPR & Privacy / Comments: 50

Share Button

This is my extension to the GTM Tips post by the excellent Simo Ahava (his post: Remove PII From Google Analytics Hits).

Essentially, I had been looking for a way to block Personally Identifiable Information (PII) hits at the collection level i.e. using GTM, before the hit is sent to Google Analytics. Why do this? Putting the obvious requirement to not gather PII to one side, if you are adding filters to your GA Views in order to delete PII it is too late – the problem has already occurred. That is, if you have already sent the personal data to Google, then you have already broken the GDPR compliance!

Previously, by using GTM I would simply drop any hits containing page URLs with an @ symbol i.e. in case the URL contained an email address. Apart from being quite blunt (not all URLs with an @ symbol contain an email address), this approach would not tackle email addresses being present in other hit types e.g. events, e-commerce data etc. It also did not tackle other PII types – such as telephone numbers, zip codes, usernames etc. Hence, the much better approach of Simo’s method – using GTM’s new customTask feature – was very interesting to me!

In this post, I extend his method by building out the regex more – for a more sophisticated email detection, and to capture other PII types…

Redact, rather than remove PII

The important thing here is to remember we are redacting the PII – not blocking or removing it. This is an important distinction. If PII is present, it is almost certain that the same PII is being logged elsewhere on your network – your web server logfile at the very least. Reporting this in your Google Analytics in redacted form means you have a monitoring system to flag to your web dev/IT team in order to fix and keep on top of. Essentially, to be compliant, PII issues need to be fixed at their source by your organisation. Alternatively, if you deleted the PII data from your reports is simply stopped collecting it in GA, you would metaphorically be sweeping the problem under the carpet.

Here is my adjusted code for your Custom JavaScript variable.

IMPORTANT: This is a straight replacement to Simo’s code. Replace domain\.com with the domain of your website (lines 6 and 10). More on what this is for later. As always, when working with code its up to you to test it and ensure it works correctly. No liability accepted!

UPDATE: If you are having issues with encoded characters displaying in your Google Analytics reports for URLs and/or page titles, try changing line 41 of the javascript code (highlighted) to the one below:

val = decodeURIComponent(decodeURIComponent(parts[1]));

function() {
return function(model) {
// Add the PII patterns into this array as objects
var piiRegex = [{
name: 'EMAIL',
regex: /[^\/]{4}@(?!domain\.com)[^\/]{4}/gi,
group: ''
},{
name: 'SELF-EMAIL',
regex: /[^\/]{4}@(?=domain\.com)[^\/]{4}/gi,
group: ''
},{
name: 'TEL',
regex: /((tel=)|(telephone=)|(phone=)|(mobile=)|(mob=))[\d\+\s][^&\/\?]+/gi,
group: '$1'
},{
name: 'NAME',
regex: /((firstname=)|(lastname=)|(surname=))[^&\/\?]+/gi,
group: '$1'
},{
name: 'PASSWORD',
regex: /((password=)|(passwd=)|(pass=))[^&\/\?]+/gi,
group: '$1'
},{
name: 'ZIP',
regex: /((postcode=)|(zipcode=)|(zip=))[^&\/\?]+/gi,
group: '$1'
}
];
// Fetch reference to the original sendHitTask
var originalSendTask = model.get('sendHitTask');
var i, hitPayload, parts, val;
// Overwrite sendHitTask with PII purger
model.set('sendHitTask', function(sendModel) {
hitPayload = sendModel.get('hitPayload').split('&');
for (i = 0; i < hitPayload.length; i++) {
parts = hitPayload[i].split('=');
val = decodeURIComponent(unescape(parts[1]));
piiRegex.forEach(function(pii) {
val = val.replace(pii.regex, pii.group + '[REDACTED ' + pii.name + ']');
});
parts[1] = encodeURIComponent(val);
hitPayload[i] = parts.join('=');
}
sendModel.set('hitPayload', hitPayload.join('&'), true);
originalSendTask(sendModel);
});
};
}

UPDATE: If you are having issues with encoded characters displaying in your Google Analytics reports for URLs and/or page titles, try changing line 41 of the javascript code (highlighted) to the one below:

val = decodeURIComponent(decodeURIComponent(parts[1]));

Once you add this variable to your Universal Analytics tags as the customTask field, any hits sent by these tags will be parsed by this variable, which replaces the instances of PII with the string [REDACTED pii_type].

For example, a URL with path:

/test?tel=+44012345678&email=brian@me.com&other=bclifton@DOMAIN.com&firstName=brian&password=hello

would be replaced with:

/test?tel=[REDACTED TELEPHONE]&email=b[REDACTED EMAIL]om&other=bcli[REDACTED SELF-EMAIL]OMAIN.com&firstName=[REDACTED NAME]&password=[REDACTED PASSWORD]

The Regex Changes Explained

-Extending the Email regex

For the EMAIL check, I make two changes to Simo’s original regex:

regex: /[^\/]{4}@(?!domain\.com)[^\/]{4}/gi,

Firstly, this matches any character that is not a forward slash / 4 times, followed by @. Then, so long as this is not followed by domain.com, it matches the next 4 characters which are not a forward slash.

So apart from looking for an email address, I am doing two extra things:

1. I exclude any “innocent” links that may be captured as outbound links containing an @. Common examples are Google Maps and Flickr links, which contain a forward slash – the [^\/] part. Example links:

  • www.google.com/maps/place/University+of+San+Francisco+-+Folger+Bldg,+101+Howard+St,+San+Francisco,+CA+94105/@37.7908871,-122.3925594,17z/data=!3m1!
  • www.flickr.com/photos/123456@N06/sets/721576344/Other PII data types

2. I exclude the domain of the website itself from this check using a negative look ahead – the (?!….) part. Remember to replace domain\.com with your own domain e.g. brianclifton\.com in my case. I match for this separately next.

My suggestion for a separate regex is to catch and redact any payloads containing the SAME email domain as the site itself, with a different “name” value to the regular email redaction. That way such emails will be reported differently in Google Analytics, allowing the site owner to ignore these and monitor real PII infringements.

For example:

  • If a visitor comes to my site and I capture their email address as simo@hissite.com, that is redaction_message [REDACTED EMAIL]
  • If a visitor comes to my site and I capture my own email address as an outbound click-through to the site owner e.g. mysite@brianclifton.com, that is redaction_message [REDACTED SELF-EMAIL]

As the site owner, the first message is the one I should be paying attention to. The second message (not really PII as it belongs to the site owner) keeps me compliant with Google’s terms of service.

For the SELF-EMAIL check, the regex is almost identical:

regex: /[^\/]{4}@(?=domain\.com)[^\/]{4}/gi,

The difference now is that I do wish to include my own domain in the match and this is achieved via a positive look ahead – the (?=….) part.

-Extending the regex to capture other PII

The original post by Simo was a simple pattern match – easy to use and maintain when you know the structure of the match you are looking for e.g. an @ symbol to match email addresses, or a well structured set of characters and numbers for strings like personal ID and social security numbers. However, I want to extend this to match less structured PII, for example people’s names, addresses, telephone numbers, zip codes etc.

To do this, we need a regex anchor. That is, a common string likely to contain such PII. I am assuming all such matches are contained within URL strings as query parameters (though name=value pairs in the URL path are also matched) e.g.

/test?tel=+46(0)12398765&firstname=Brian&zip=abc123

The anchor is the query name and we match for common PII culprits – these are tel, firstname and zip in my example. Of course these should be adjusted for your particular language. Anchors are the reason why the group key is required:

name: 'ZIP',
regex: /((postcode=)|(zipcode=)|(zip=))[^\/\?&]+/gi,
group: '$1'

In this case, $1 is the value of the string (our anchor) just before and including the = sign. We keep this in place for the data hit, and redact what follows. Without applying the grouping, the entire name=value pair would be redacted making troubleshooting difficult. I use [^&\/\?] in order to conclude the match within paths, or query parameters…

Happy compliance testing 🙂

BTW, you do you know I am building a data auditing and compliance tool to measure and monitor Google Analytics data quality, right?

[Thank you to the equally excellent David Vallejo for his JavaScript help – my skills are simply too rusty!]

Share Button

Comments (most recent first)

  1. Christina says:

    Hi Brian,

    I really love your modified version of Simo’s script.
    One question – Since you are specifying the website domain. How would this work on a rollup account that holds data for multiple domains?

    Thanks much!
    Christina

  2. Bryan says:

    Hi Brian,

    I am currently using the “Exclude URL Query Parameters” option in my GA view settings to remove PII from GA (not a filter). Will this prevent PII from getting recorded on GA’s servers?

    • That will work, BUT that removes the important signal that something is wrong i.e. the PII will be logged elsewhere – such as web server log files and routers around the internet. Better to redact PII by default – see this post: https://brianclifton.com/blog/2017/09/07/remove-pii-from-google-analytics/, setup a GA alert for when this appears in your reports, then fix the issue at source.

      That would be GDPR compliant i.e. you have a process to monitor and fix PII issues. Your quick fix is not…

      • Dear Brian,

        I believe you are mistaken. No view settings (not a filter, not the exclusion of query parameters, not any view setting) will effectively prevent PII from being recorded in GA, since a view setting only alters the way you SEE your data. The captured data, including PII, will still be stored on a property level, which is still prohibited by the terms of GA.

        So Bryan, applying any sort of view setting to hide/filter/exclude your PII won’t prevent the recording of PII on the Google servers. The only way to prevent PII being recorded in GA is preventing that it even enters GA by either using cool scripts (in GTM) like Brian’s or Simo’s, or just fixing your website.

        Hope it helps.

      • Bryan says:

        Thank you for the reply, Brian! I will implement your custom JavaScript in GTM to prevent PII from getting on GAs servers.

        Is there something I can do to remove the PII that’s been getting pulled into GA, or is it too late?

      • Bryan says:

        Hi Brian,

        Should customTask popup in the Field Name dropdown of the “More Settings” in my Universal Analytics tag? I’m not seeing it. Does customTask need to be added to my GTM container? I’m a little confused here.

        Would you be able to point me in the right direction?

  3. Lifan Shiu says:

    Hi,

    Thanks for the post. It works, but I want to extend this and want to use this to redact IBAN (banknumber within Europe). An IBAN number consist of:
    – Countrycode (2 letters) for example: NL
    – 2 check numbers for example 53
    – Bank code (4 characters) for example: ABNA
    – Bankaccountnumber (10 numbers): 1234567890

    The whole IBAN number would be: NL53ABNA1234567890

    I tried the following regex: /^[a-zA-Z]{2}[0-9]{2}[a-zA-Z0-9]{4}[0-9]{7}([a-zA-Z0-9]?){0,16}$/gi;

    but that didn’t work. I don’t a lot of regex so if someone can help me out, that would be great.

    Thanks in advance.

    • Looks like you are missing a set of brackets i.e. to wrap the preceding token before the quantifier {0,16}

      So: ^(everything in here){0,16}$/gi

      Remember this only works in GA. So if you have an issue collecting this type of data, better to fix the underlying problem as this is also likely to be in your server/router logfiles.

      HTH

  4. Jeroen says:

    Great article so many thanks for sharing.

    Can you give any advice on how to test this in preview before I publish?

    Is this even possible?

    Thanks in advance

  5. John says:

    When you say replace “domain\.com” with your domain, does that mean “example.com” becomes “example\.com” ?

  6. Bob says:

    Hi Brian,

    Is there a step by step video or article to walk me through this? I’m new to GTM and not overly familiar with the process of setting up triggers with variables.

    Thanks,

  7. Peter says:

    Great article so many thanks for sharing.
    Got a quick question on the script itself. Was wondering, if there is a way to get the full functionality scope of anonymizing PII, but avoid special characters in title tags etc. to be encoded in the reports. (characters like ü, ä, etc.)

    Any advice on that would be much appreciated
    thanks

    • Hello Peter + are you seeing an issue with these chars? I live in Sweden and have used this method with åäö chars without problems…

      • Peter says:

        Hi Brian,

        Yes, I do indeed. If I roll back the customTask implementation in GTM everything goes back to “normal”. That’s why I believe its related to the PII script in your post. Results are like: https://prnt.sc/joyft8
        So main characters that mess around are the German ü,Ü,ä,Ä,ö,Ö but also symbols like “»”. Onsite Meta Tags are decoded correctly, just in the reports the encode for some reason

        • Did you use the “Update” notice below the code i.e. changing line 41?

          I am using that change (some servers handle such chars differently) and it works for me.

          • Peter says:

            UPDATE: Apologize Brian! Your line 41 update actually solved it. My mistake. Sorry again and thanks for helping out

  8. nicolaos says:

    Thanks for this Awesome post ! Quick question: Why not treat EMAIL query as all others like firstname and telephone.
    In order words, why not remove all the value of the email query instead just the 4 keys before and after the @

    Thanks

    • Thanks for the feedback. The simple answer is because you can i.e. there is a nice anchor (the @) to use that other fields don’t have. That way, the owner of the data is able to understand exactly what email addresses are being captured. For example, it could be legitimate mailto links to resellers i.e. not really PII. There is just no way to do this with other potential PII.

      • nicolaos says:

        Superb! Thanks.

        Any ideas how to implement this for Adwords too?

        Unfortunately couldn’t add a customTask for Adwords the same way I did for GA.

        Thanks in advance

        • No, this is specific to GA at present…

          • Hi nicolaos and Brian,

            What we’ve done in our case to prevent PII to be sent to AdWords is to simply create a GTM variable which compares the audited URL and compares it with the original page URL. If both of them are the same, nothing was audited (URL is OK), if they are not the same, likely PII was detected.

            Using the output from this variable either in your trigger for firing the AdWords tag(s), or by embedding this variable in an exception trigger, you can control whether the AdWords tag is fired or not; preventing PII to be sent to AdWords.

            Yes, you may miss some remarketing / conversion information, but at least PII will no longer be sent to AdWords (assuming that you’ve tackled all forms of PII possible on your webpage).
            Using Google Analytics you can find out what the problematic page is and make sure it will be fixed by the web developer. Ideally there should never be any personal information present in the page URL.

  9. Ankit says:

    Great post.
    Just one simple question.
    Do you know how I can use customTask for ‘AdWords Remarketing’ tags (to redact PII)? I use it very well for ‘Universal Analytics’ tags.

  10. Marco says:

    Hi Brian,

    Thank you very much for posting this useful information.

    I am quite interested on implementing this mechanism on my website, which is coded over PHP instead of Javascript. Do you know if this is a limitation to implement GTM (i think the frame of code is just available on .js coding?) and consequently your custom JavaScript variable?

  11. Travis B says:

    Any way to go about doing this in GA without using Tag Manager?

    • You really want to do this at the point of data collection i.e. via GTM (or other tag manager solution) and not once the data is already in your GA account. Essentially, once you have collected the PII data you have already broken data privacy laws – regardless of where it is then stored/processed.

      However, note this technique is a monitoring system for you – using GA to redact and then flag up PII issues with your website. Ultimately you these need to be flagged to the IT/Web Dev team to sort out the underlying issue as even if the data does not get into GA it is very likely to be in web server log files, router log files and firewall log files etc.

      • Travis B says:

        Thanks Brian! I’m actually having to use analytics.js just for the sake of the way our e-commerce tracking is set up. I found out I can load analytics.js and then the ecommerce.js plugin without sending a page view. and then use GTM for page views and setting up this PII flagging. Great post! Thanks for the response.

  12. Matt says:

    Thanks for sharing! I really like the idea of capturing more types of PII. However, it seems like this code snippet only works for the EMAIL and SELF-EMAIL patterns. For the other types of PII, it seems like the regex pattern isn’t getting applied to the right piece.

    It looks like the line
    “`
    hitPayload = sendModel.get(‘hitPayload’).split(‘&’)

    breaks the URL into an array of strings, removing and splitting it at any `&`’s.

    So, a starting URL of:
    “`
    /test?tel=+44012345678&email=brian@me.com&other=bclifton@DOMAIN.com&firstName=brian&password=hello
    “`

    Would look like this as a JavaScript array:
    “`
    [
    “/test?tel=+44012345678”,
    “email=brian@me.com”,
    “other=bclifton@DOMAIN.com”,
    “firstName=brian”,
    “password=hello”
    ]
    “`

    You then cycle through this array with a for loop:
    “`
    for (i = 0; i < hitPayload.length; i++) {

    }
    “`

    Inside the for loop, you break each string into another array of strings with by the `=`:
    “`
    parts = hitPayload[i].split('=');
    “`

    So a piece of the larger array like:
    “`
    "firstName=brian"
    “`

    Becomes:
    “`
    ["firstName", "brian"]
    “`

    You then decode and assign to `val` the second item in the array.
    “`
    val = decodeURIComponent(unescape(parts[1]));
    “`

    Here’s what looks like a problem to me: only "val" gets the regex pattern applied to it:
    “`
    piiRegex.forEach(function (pii) {
    val = val.replace(pii.regex, pii.group + '[REDACTED ' + pii.name + ']');
    });
    “`

    The code seem to be cycling through the PII regex patterns and applying both the pattern and replacement only to the values of any query parameters, but not the query. The anchor needs to be searching for terms like `firstName`, but I think it’s only seeing values like "brian" all the way down.

    You then encode the value again, join each part back by the =, and join everything back up into one url with the &.
    “`
    parts[1] = encodeURIComponent(val);
    hitPayload[i] = parts.join('=');
    } // end for loop
    sendModel.set('hitPayload', hitPayload.join('&'), true);
    “`

    Am I missing something? It seems like the regex pattern is only getting applied to the second part of `parts` or the value in the query parameter (e.g. “brian"), but never the full query parameter (e.g. “firstName=brian")

  13. Simms says:

    Great post but one small question.

    Is it possible to just make this apply to GA and not our own software system. Is this code a blanket block for all tags or could some tags be on. White list?

    • Hello @simms – this method uses a Custom JS variable that is applied via Google Analytics’s new customTask feature. You apply it to which ever GA tags you need, but this will not work with non-GA tags unless the customTask feature is available to use. Note: customTask is a very new feature (Aug 2017) and I am not aware of other tags that can use this.

      That said, if you are collecting PII in plain text e.g. within URLs (regardless of where you are sending it), then you have a privacy issue as it will be logged by default on your webserver and on every router the URL passes through…

  14. Stu Bowker says:

    Thanks Brian. Might be worth adding ‘username’ too.

    • Yes exactly. I would customise the regex for your environment. Accountname, uname, user_name, customer etc., are all English possibilities. And if you work in multiple markets you will need to consider others as well e.g. kund(er) for the Swedish market. Essentially, a generic regex for all users doesn’t really work, it needs to be tailored…

  15. Jon Hibbitt says:

    Thanks! Awesome post and nice enhancements on Simo’s original. This will be going into the toolbox for sure.

  16. Hi Brian,

    I’m surprised that you included [redacted self-email] as an issue with regards to GA TOS. I would argue that the Terms of Service are concerned with personally identifying visitors to your site, and are not concerned with the personal details of the users of the website itself.

    As such, I find the customTask to be way to broad with regards to how it redacts information. My take is that doing things such as tracking who the user is trying to contact is not problematic. That applies to phone numbers tapped and mailto: links clicked. Additionally, while I would agree that a users address is out of bounds for data collection in GA, I don’t see how a postal code is a problem. Please argue the other side with me in the comments; I’m open to hearing it.

    Hope you’re doing well.

    Yehoshua

    • Hej @Yehoshua – although capturing your own orgs’ email addresses with GA reports is not strictly PII, it would likely be picked up by Google’s compliance robots. And who knows what that would result in – possibly data being deleted, your account suspended etc. Although you would be correct to argue the fine nuance of the reality, my suggestion is to avoid it in the first place i.e. good luck talking to a human compliance/legal officer at Google to make your case!

      In terms of zip codes, these can get very specific. For example, in the UK they can be limited to a set of 5 houses. Also, I found this article specific to Canada:

      “…it was found that 87.9% of the postal code locations were within 200 meters of the true address location (straight line distances) and 96.5% were within 500 meters of the address location (straight line distances).”

      https://ij-healthgeographics.biomedcentral.com/articles/10.1186/1476-072X-3-5

      • Stu Bowker says:

        I believe that capturing the first part of a UK postcode is OK because it’s at a city level, not street. For example, if the postcode equals “BA1 1AA”, then collecting just “BA1” in GA is fine.

        • Most likely true for the UK (always check such things with your legal/compliance team) and the regex can be customised accordingly. I would suggest if the post code is present, then the street address should also be checked and redacted…

      • Milos says:

        Precisely. As one GA expert once told me, don’t f*k with it 🙂
        If there is a remote possibility that GA will interpret the data as PII, don’t import it. Great post.

Leave a Reply

Your email address will not be published. Required fields are marked *

Anti-spam question (required):

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© Brian Clifton 2018
Best practice privacy statement