How to Remove PII from Google Analytics – the smart way!

This is my PII extension to the initial post by the excellent Simo Ahava (his post: Remove PII From Google Analytics Hits).

Essentially, I had been looking for a way to block Personally Identifiable Information (PII) hits at the collection level i.e. using GTM, before the hit is sent to Google Analytics.

Why do this?

Putting the obvious requirement to not gather personal data to one side, if you are adding filters to your analytics views to delete PII, it is simply too late – the problem has already occurred and GDPR compliance has been broken! See my related post on why filters are not sufficient.

Previously, by using GTM I would simply drop any hits containing page URLs with an @ symbol i.e. in case the URL contained an email address. Apart from being quite blunt (not all URLs with an @ symbol contain an email address), this approach would not tackle email addresses being present in other hit types e.g. events, e-commerce data etc. It also did not tackle other PII types – such as telephone numbers, zip codes, usernames etc. Hence, the much better approach of Simo’s method – using the new customTask api of Universal Analytics – was very interesting to me!

In this post, I extend his method by building out the regex more – for a more sophisticated email detection, and to capture other PII types…

Redact, rather than remove PII

The important thing here is to remember we are redacting the PII – not blocking or removing it. This is an important distinction. If PII is present, it is almost certain that the same PII is being logged elsewhere on your network – your web server logfile at the very least. Reporting this in your Google Analytics in redacted form means you have a monitoring system to flag to your web dev/IT team in order to fix and keep on top of. Essentially, to be compliant, PII issues need to be fixed at their source by your organisation. Alternatively, if you deleted the PII data from your reports is simply stopped collecting it in GA, you would metaphorically be sweeping the problem under the carpet.

Here is my adjusted code for your Custom JavaScript variable.

IMPORTANT: This is a straight replacement to Simo’s code. Replace example\.com with the domain of your website (lines 7 and 11). More on what this is for later. Thank you to the excellent David Vallejo for his JavaScript help – my skills are simply too rusty nowadays! As always, when working with code it’s up to you to test it and ensure it works correctly. No liability accepted!

UPDATE: This code was rewritten 29-Aug-2018 for better handling of the GA hit. In particular, it now works with GTM’s native YouTube trigger.  Simply swap out the original code for this new one.

function() {
  return function(model) {
    try{
      // Add the PII patterns into this array as objects
      var piiRegex = [{
        name: 'EMAIL',
        regex: /[^\/]{4}(@|%40)(?!example\.com)[^\/]{4}/gi,
        group: ''
      },{
      name: 'SELF-EMAIL',
        regex: /[^\/]{4}(@|%40)(?=example\.com)[^\/]{4}/gi,
        group: ''
      },{
        name: 'TEL',
        regex: /((tel=)|(telephone=)|(phone=)|(mobile=)|(mob=))[\d\+\s][^&\/\?]+/gi,
        group: '$1'
      },{
        name: 'NAME',
        regex: /((firstname=)|(lastname=)|(surname=))[^&\/\?]+/gi,
        group: '$1'     
      },{
        name: 'PASSWORD',
        regex: /((password=)|(passwd=)|(pass=))[^&\/\?]+/gi,
        group: '$1'
      },{
        name: 'ZIP',
        regex: /((postcode=)|(zipcode=)|(zip=))[^&\/\?]+/gi,
        group: '$1'
      }

    ];		    
      // Fetch reference to the original sendHitTask
      var originalSendTask = model.get('sendHitTask');
      var i, hitPayload, data, val;


      model.set('sendHitTask', function(sendModel) {
          hitPayload = model.get('hitPayload');	
          //  Let's convert the current querystring into a key,value object
          data = (hitPayload).replace(/(^\?)/,'').split("&").map(function(n){return n = n.split("="),this[n[0]] = n[1],this}.bind({}))[0];
		  //  We'll be looping thu all key and values now
          for(var key in data){

              // Let's have the value decoded before matching it against our array of regexes
              piiRegex.forEach(function(pii) {	
                var val = decodeURIComponent(data[key]);			        	
                // The value is matching?
                if(val.match(pii.regex)){
                  // Let's replace the key value based on the regex and let's reencode the value
                  data[key] = encodeURIComponent(val.replace(pii.regex, pii.group + '[REDACTED ' + pii.name + ']'));			          
                }                        
              });  
            			    
          }        
          // Going back to roots, convert our data object into a querystring again =)    
          sendModel.set('hitPayload', Object.keys(data).map(function(key) { return (key) + '=' + (data[key]); }).join('&'), true);
          // Set the value
          originalSendTask(sendModel);
      });    
    }catch(e){}
  };
}

Edit Your Tags

In order to function as intended, the customTask field needs to be added to ALL Google Analytics tags. That of course is cumbersome and does not scale with the volume of tags used. Therefore it is much better to apply this as a one-time fix in a Google Analytics settings variable. You can read more about the power of the Universal Analytics settings variable approach from Simo.

Now any hits sent by these tags will be parsed by this variable, which replaces the instances of PII with the string [REDACTED pii_type]. For example, a URL with path:

/test?tel=+44012345678&email=brian@me.com&other=bclifton@DOMAIN.com&firstName=brian&password=hello

would be replaced with:

/test?tel=[REDACTED TELEPHONE]&email=b[REDACTED EMAIL]om&other=bcli[REDACTED SELF-EMAIL]OMAIN.com&firstName=[REDACTED NAME]&password=[REDACTED PASSWORD]

The Regex Changes Explained

-Extending the Email regex

For the EMAIL check, I make two changes to Simo’s original regex:

regex: /[^\/]{4}@(?!domain\.com)[^\/]{4}/gi,

Firstly, this matches any character that is not a forward slash / 4 times, followed by @. Then, so long as this is not followed by domain.com, it matches the next 4 characters which are not a forward slash.

So apart from looking for an email address, I am doing two extra things:

1. I exclude any “innocent” links that may be captured as outbound links containing an @. Common examples are Google Maps and Flickr links, which contain a forward slash – the [^\/] part. Example links:

  • www.google.com/maps/place/University+of+San+Francisco+-+Folger+Bldg,+101+Howard+St,+San+Francisco,+CA+94105/@37.7908871,-122.3925594,17z/data=!3m1!
  • www.flickr.com/photos/123456@N06/sets/721576344/Other PII data types

2. I exclude the domain of the website itself from this check using a negative look ahead – the (?!….) part. Remember to replace domain\.com with your own domain e.g. brianclifton\.com in my case. I match for this separately next.

My suggestion for a separate regex is to catch and redact any payloads containing the SAME email domain as the site itself, with a different “name” value to the regular email redaction. That way such emails will be reported differently in Google Analytics, allowing the site owner to ignore these and monitor real PII infringements.

For example:

  • If a visitor comes to my site and I capture their email address as simo@hissite.com, that is redaction_message [REDACTED EMAIL]
  • If a visitor comes to my site and I capture my own email address as an outbound click-through to the site owner e.g. mysite@brianclifton.com, that is redaction_message [REDACTED SELF-EMAIL]

As the site owner, the first message is the one I should be paying attention to. The second message (not really PII as it belongs to the site owner) keeps me compliant with Google’s terms of service.

For the SELF-EMAIL check, the regex is almost identical:

regex: /[^\/]{4}@(?=domain\.com)[^\/]{4}/gi,

The difference now is that I do wish to include my own domain in the match and this is achieved via a positive look ahead – the (?=….) part.

-Extending the regex to capture other PII

The original post by Simo was a simple pattern match – easy to use and maintain when you know the structure of the match you are looking for e.g. an @ symbol to match email addresses, or a well structured set of characters and numbers for strings like personal ID and social security numbers. However, I want to extend this to match less structured PII, for example people’s names, addresses, telephone numbers, zip codes etc.

To do this, we need a regex anchor. That is, a common string likely to contain such PII. I am assuming all such matches are contained within URL strings as query parameters (though name=value pairs in the URL path are also matched) e.g.

/test?tel=+46(0)12398765&firstname=Brian&zip=abc123

The anchor is the query name and we match for common PII culprits – these are tel, firstname and zip in my example. Of course these should be adjusted for your particular language. Anchors are the reason why the group key is required:

name: 'ZIP',
regex: /((postcode=)|(zipcode=)|(zip=))[^\/\?&]+/gi,
group: '$1'

In this case, $1 is the value of the string (our anchor) just before and including the = sign. We keep this in place for the data hit, and redact what follows. Without applying the grouping, the entire name=value pair would be redacted making troubleshooting difficult. I use [^&\/\?] in order to conclude the match within paths, or query parameters…

Happy compliance testing 🙂

BTW, you do you know I am building a data auditing and compliance tool to measure and monitor Google Analytics data quality, right?

Looking for a keynote speaker, or wish to hire Brian…?

If you are an organisation wishing to hire me and my team, please view the Contact page. I am based in Sweden and advise organisations in Europe as well as North America.

You May Also Like…

Sayonara Universal Analytics

Sayonara Universal Analytics

My first Google Analytics data point was 15th May 2005 for UA-20024. If you are of a certain age, that may sound off...

91 Comments

  1. James

    Hi Brian,

    This is great and worked a treat in UA. How are you approaching this in GA4?

    Thanks

    Reply
    • Brian Clifton

      Hello James – customTask is specific to GTM and Universal Analytics. There is no equivalent for GA4. However take a look at verified-data.com

      Reply
  2. Kathy

    Hi There
    I’ve implemented this script and it seems to work very well, however…. I’m running into an issue with double counting of page views – one with “REDACTED” and one with the PII. The situation is that a user fills out a booking form on our site – it’s then handled by a third party – and then then the user is sent back to the main site to a thank you page, which is appending the PII in the URL – i.e. /booking-confirmation/?sessionID=4444&fname=John&lname=Smith. Does anyone by any chance have a solution or GTM tweak for this? Is it possible to have the third party maybe strip out the PII but keep the sessionID before sending it over to GA? Thank you!

    Reply
  3. AB

    Can I use this technique to remove parts of the URL (not just query strings or fields that are dynamically added? So for example, if my page was: http://www.mysite.com/widgets-for-sale-low-prices could I strip out part of the URL such as “low-prices” from going to GA?

    Reply
    • Brian Clifton

      I guess you could modify the script in that way, but that is not its purpose. I think you should investigate using filters in GA for this.

      Reply
  4. Ankit Garg

    HI,

    How to redact the PII information at Event action. Will it possible from same code?

    Regards,
    Ankit

    Reply
    • Brian Clifton

      Yes, the redaction method is based on Simo’s original post about customTask. It redacts the entire GA hit. Please ensure you understand how customTask works before deploying this enhancement.

      Reply
  5. Hamza

    This code works for me on one link and not on the other for some reason.

    It works well on this
    /website/website-search-result/home?location=[hidden location]&distance=10&provider=&coursehours=all&coursetype=all&startdate=anytime&startdateday=22&startdatemonth=5&startdateyear=2020

    it does not work on the one below
    /website/course-details?courseid=987777777ghhda7a22169233&r=b0f44b90-ca8d-4e59-af92-b28ee7078547&referralpath=/website/website-search-result/?searchterm=law&location=london&distance=10&provider=&coursehours=all&coursetype=all&startdate=anytime&startdateday=23&startdatemonth=5&startdateyear=2020

    Any help would be appreciated. thanks

    Reply
    • Brian Clifton

      Which query parameter are you trying to redact? You will need to adjust/append the regex array (piiRegex) accordingly.

      Reply
  6. Nabeel

    Hi Brian,

    I hope you are well and thank you for the great article.

    Which regex can I use to remove a name and surname from the following URL:
    https://example.com/user/john.smith/ ?

    I have tried using this regex from your article “regex: /((firstname=)|(lastname=)|(surname=))[^&\/\?]+/gi,
    group: ‘$1’ “, however, that regex only works if the url has, for example, “/firstname=john” in the URL.

    Thank you.

    Reply
  7. David

    I just wanted to check if this will automatically redact PII that has been entered in a site search if Site Search tracking has been enabled?

    Reply
    • Brian Clifton

      Hello David – if the capture of site search is in the “GA standard way” i.e. a URL query parameter, then yes this technique will work. However only for email addresses.

      Reply
  8. mimi

    Hello Brian,

    You said that it’s not possible to remove historical PII data, but do you think we can remove them from the user explorer report?
    From that report, we can easily isolate user ID that contains PII data and then, remove them. Any thoughts about it?

    Reply
    • Brian Clifton

      Hello Mimi – yes, in the User Explorer report there is a new “Delete User” feature (from Oct 2019 from memory). Though of course you need to know the cookie ID of the user in the first place. This is useful for right to be forgotten requests.

      Reply
  9. Thom Mikhout

    Hi Brian,

    First, thank you for the comprehensive blogpost about removing PII from Google Analytics.

    After implementing this PII solution, I noticed that RegEx for email only redacts the email partially. Now, I would like to fully redact the email address. Is this possible (with a different version of the RegEx)? Do you know how I can do this?

    Thank you in advance and awaiting your reply.

    Thom

    Reply
  10. Tina

    Is it possible to delete historical PII?

    Reply
    • Brian Clifton

      Hello Tina – there is no way to delete any historical data in Google Analytics.

      However, if Google picks it up themselves then expect them to to delete ALL data for the affected data range. That is, all data even if not the offending PII data. Its very blunt… so prevention is the key. Make sure you check out verified-data.com (my day job!)

      Reply
  11. Tim Borden

    Just an FYI. There’s a small bug that means only the last regex replacement will be applied if there is more than one match.

    It can be fixed by moving:
    var val = decodeURIComponent(data[key]);

    …inside the loop:
    piiRegex.forEach(function(pii) {

    For example:
    piiRegex.forEach(function(pii) {
    var val = decodeURIComponent(data[key]);

    Hope that helps.

    Reply
    • Brian Clifton

      Yes, see also Billy H reply just below.

      I have not had chance to sit down and work through this, so thanks for the reminder.

      Reply
      • Brian Clifton

        OK, code updated now. Thanks again for the heads up and taking the trouble to post the fix 🙂

        Reply
        • Tim Borden

          Cheers! ….thanks for posting the article.

          Reply
  12. Nathan

    Hi,

    Am currently trying to achieve something similar but more comprehensive. Originally I was trying to come up with a way to change the “Exclude Query Parameters” function to “Include Query Parameters”.

    For this, I arrived at a filter set.

    1. Change the query prefix of things you want e.g. ?q= or &q= too some different symbol (e.g. ~~~)
    2. Apply a blanket filter for any item ? or & to be removed completely (replace with blank)
    3. Change anything with ~~~ back to a query string form

    Really annoying, but theoretically sound. I was about to test this though, when I realised there will be a difference between human or tech error (e.g. when your email platform accidentally sends a name etc) and malicious injection (e.g. I type exampledomain.com/personsname/email/phone) which results in data being collected that is personal.

    Wondering if there is a combination of the two needed. Or do we all agree that we’ll never be able to protect against external bad actors?

    Nathan

    Reply
    • Brian Clifton

      The thing is… Anything done within GA is too late. The PII has already been collected and so adding filters post collection is only really hiding the issue.

      Reply
      • Nathan

        Ok, but if you miss a piece of PII through this method, since you’re essentially manually declaring it then you are also in contravention of PII. Further, under GDPR it goes beyond PII to personal data, so IPs, pseudonymous identifiers (e.g. within email etc).

        Not criticising either approach, just wondering how we work to a solution. Seems like something Google needs to get involved in.

        Reply
        • Brian Clifton

          Essentially, there is no one single “tip” to protect all possible PII risks. However, have a look at verified-data.com as a way of putting in place a process for GDPR compliance with Google Analytics… 😉

          Reply
  13. Billy H

    Hi Brian,

    Thanks for this in depth explanation, it was exactly what I needed. However, I ran into an issue with the provided javascript. I’m unsure if my issue was isolated to my site or if it has something to do with a recent browser standard change or similar issue (I’m using Chrome), but I was able to find a solution and thought I’d share it here in case anyone else runs into something similar.

    The code, as provided, was only redacting the last entry of the PII Regex list, so only the ZIP was being redacted. After spending a bit of time in GTM’s preview mode and a lot of uses of console.log(), I realized that the page URI’s encoding within the ‘hitpayload’ was preventing it from being split using “&” as the delimiter (line 40). Because of that, the entire URI was being tested as a single Key Value, and line 50 was editing the original Key Value with every iteration, overwriting previous changes.

    The simplest solution I found was to add “val = decodeURIComponent(data[key]);” after line 46, so that the loop would be working with the updated value after each iteration, thereby maintaining previous changes.

    Still not sure if this was an edge case situation, but perhaps I’ll save someone some troubleshooting.

    Thanks!

    Reply
    • Sunny

      That was a good catch Billy H

      Adding after Line number 46
      val = decodeURIComponent(data[key]);

      Reply
      • Brian Clifton

        OK, code updated now. Thanks again for the heads up and taking the trouble to post the fix 🙂

        Reply
    • Chris Justin

      Had the same problem. Thank you for the fix!

      Reply
  14. John Tranberg

    Nice article.

    I have 1 question. Lets say my regex is

    regex: /((id=)|(cusid=))[^&\/\?]+/gi,

    This matches:
    all that ends with id= like:
    testid=
    tokenid=

    I want it to only match id= and cusid=
    What do i need to change to get that to work?

    Reply
    • Brian Clifton

      Your first check is causing you this issue: id=

      Let us assume you are referring to query params to match against, in which case you can check as your first match: &|\?id=

      HOWEVER, remember the customTask is matching against the entire GA data hit. Therefore be care in what you match for. For example, id= would match UA-ID= and that could prevent your data even reaching your GA account…

      Reply
  15. Stu Bowker

    Hi Brian, love this but I’ve found the PII custom variable prevents some tags from being sent to GA despite being triggered correctly. The GA code would get so far, then in the console we’d see ‘[Violation] ‘setTimeout’ handler took 105ms’.

    This happened for YouTube and Scroll events using GTMs built in functionality. Both were based on users reaching a certain percentage. These events incurred the error. However, the YouTube start and pause events worked fine.

    I’ve added a temporary fix for the PII variable to only work if the event = gtm.js. Can you recommend something more robust in case there’s events that include PII.

    Reply
    • Brian Clifton

      Hello Stu – are you using the latest version? I updated it a week or so ago so that YouTube can be tracked. It is working for me, but no guarantees

      Reply
      • Stu Bowker

        Hi Brian,
        Thanks for the update. It works great for YouTube now, but there’s still something not right as it’s not allowing Scroll Depth events to be sent 🙁

        Any ideas?

        Reply
        • Brian Clifton

          Scroll depth – yuck! Why would anyone want to track that?

          Seriously, the script is something I developed for myself and share with others freely as-is. It isn’t something I support.

          Reply
        • David Vallejo

          Hi Stu, without further details it’s hard to know what’s going on. Please change the line 60 from:

          }catch(e){}
          to
          }catch(e){ console.log(“PII SCRIPT ERROR”, e); }

          And let us know what’s the error being thrown to the console.

          Another helpful details would be knowing which category/action/label values are you using for your events in order to be able to replicate your setup

          Reply
          • Stu Bowker

            Thanks David,
            Weirdly it’s working now. Not sure what’s changed, perhaps a new release by the client.

  16. Novak Mirkovic

    I can see [REDACTED EMAIL] in the real-time report, however, when I try to look back and perform content drill-down all those visits are missing. Most likely I am missing something. Looking forward to your input. Thank you!

    Reply
    • Brian Clifton

      Do you literally mean the “Site Content/Content Drilldown” report in GA? If so, that is only the performance of directories. Page performance is one-up on the side menu: All Pages. That is where you will find the redacted URLs and Page Titles. Note, the technique redacts ALL of the data hit. So it redacts where ever the issue is (event, custom dimension, e-commerce fields etc.)

      Reply
  17. Peter

    Hi Brian,

    When implementing the script above I encountered an issue on my site today. May you can give me your thoughts. For some reason, the beacon is not submitted to Google anymore on “/” and “/de/”. As soon as I reset the customTask (remove) things work smoothly again.

    Do you have any idea, why this might happen?
    p.s.: have the script implemented on another site with same CMS where I don’t see this odd behaviour.

    Thanks in advance

    Reply
    • Brian Clifton

      Hello Peter – Sounds like something very specific to your implementation. Sorry I can’t be much help, but please post back when you discover the fix.

      Reply
  18. Christina

    Hi Brian,

    I really love your modified version of Simo’s script.
    One question – Since you are specifying the website domain. How would this work on a rollup account that holds data for multiple domains?

    Thanks much!
    Christina

    Reply
    • Brian Clifton

      Thanks for the feedback Christina.

      Just modify the regex to suit your needs e.g.
      [^\/]{4}@(?!domain\.com|another\.se)[^\/]{4}

      Reply
  19. Bryan

    Hi Brian,

    I am currently using the “Exclude URL Query Parameters” option in my GA view settings to remove PII from GA (not a filter). Will this prevent PII from getting recorded on GA’s servers?

    Reply
    • Brian Clifton

      That will work, BUT that removes the important signal that something is wrong i.e. the PII will be logged elsewhere – such as web server log files and routers around the internet. Better to redact PII by default – see this post: https://brianclifton.com/blog/2017/09/07/remove-pii-from-google-analytics/, setup a GA alert for when this appears in your reports, then fix the issue at source.

      That would be GDPR compliant i.e. you have a process to monitor and fix PII issues. Your quick fix is not…

      Reply
      • Daan (Maxlead)

        Dear Brian,

        I believe you are mistaken. No view settings (not a filter, not the exclusion of query parameters, not any view setting) will effectively prevent PII from being recorded in GA, since a view setting only alters the way you SEE your data. The captured data, including PII, will still be stored on a property level, which is still prohibited by the terms of GA.

        So Bryan, applying any sort of view setting to hide/filter/exclude your PII won’t prevent the recording of PII on the Google servers. The only way to prevent PII being recorded in GA is preventing that it even enters GA by either using cool scripts (in GTM) like Brian’s or Simo’s, or just fixing your website.

        Hope it helps.

        Reply
      • Bryan

        Thank you for the reply, Brian! I will implement your custom JavaScript in GTM to prevent PII from getting on GAs servers.

        Is there something I can do to remove the PII that’s been getting pulled into GA, or is it too late?

        Reply
      • Bryan

        Hi Brian,

        Should customTask popup in the Field Name dropdown of the “More Settings” in my Universal Analytics tag? I’m not seeing it. Does customTask need to be added to my GTM container? I’m a little confused here.

        Would you be able to point me in the right direction?

        Reply
        • Brian Clifton

          Its not part of the auto-complete lists at present, so you need to write it out.

          Reply
  20. Lifan Shiu

    Hi,

    Thanks for the post. It works, but I want to extend this and want to use this to redact IBAN (banknumber within Europe). An IBAN number consist of:
    – Countrycode (2 letters) for example: NL
    – 2 check numbers for example 53
    – Bank code (4 characters) for example: ABNA
    – Bankaccountnumber (10 numbers): 1234567890

    The whole IBAN number would be: NL53ABNA1234567890

    I tried the following regex: /^[a-zA-Z]{2}[0-9]{2}[a-zA-Z0-9]{4}[0-9]{7}([a-zA-Z0-9]?){0,16}$/gi;

    but that didn’t work. I don’t a lot of regex so if someone can help me out, that would be great.

    Thanks in advance.

    Reply
    • Brian Clifton

      Looks like you are missing a set of brackets i.e. to wrap the preceding token before the quantifier {0,16}

      So: ^(everything in here){0,16}$/gi

      Remember this only works in GA. So if you have an issue collecting this type of data, better to fix the underlying problem as this is also likely to be in your server/router logfiles.

      HTH

      Reply
  21. Jeroen

    Great article so many thanks for sharing.

    Can you give any advice on how to test this in preview before I publish?

    Is this even possible?

    Thanks in advance

    Reply
    • Brian Clifton

      Go ahead and preview your tags and you will see the redacted info in your Real Time analytics reports…

      Reply
      • jeroen

        Thanks for that tip.

        I have set up everything as explained but am seeing no change in the realtime results…

        Is there a delay when published before you can see the REDACTED results?

        Reply
        • Brian Clifton

          It will be instant. Are you sure that the tag that fires contains the customTask field name with the js code variable?

          Reply
  22. John

    When you say replace “domain\.com” with your domain, does that mean “example.com” becomes “example\.com” ?

    Reply
    • Brian Clifton

      If the website you are using this on is mysite.com, replace instances of example\.com in the code with mysite\.com

      Reply
  23. Bob

    Hi Brian,

    Is there a step by step video or article to walk me through this? I’m new to GTM and not overly familiar with the process of setting up triggers with variables.

    Thanks,

    Reply
  24. Peter

    Great article so many thanks for sharing.
    Got a quick question on the script itself. Was wondering, if there is a way to get the full functionality scope of anonymizing PII, but avoid special characters in title tags etc. to be encoded in the reports. (characters like ü, ä, etc.)

    Any advice on that would be much appreciated
    thanks

    Reply
    • Brian Clifton

      Hello Peter + are you seeing an issue with these chars? I live in Sweden and have used this method with åäö chars without problems…

      Reply
      • Peter

        Hi Brian,

        Yes, I do indeed. If I roll back the customTask implementation in GTM everything goes back to “normal”. That’s why I believe its related to the PII script in your post. Results are like: https://prnt.sc/joyft8
        So main characters that mess around are the German ü,Ü,ä,Ä,ö,Ö but also symbols like “»”. Onsite Meta Tags are decoded correctly, just in the reports the encode for some reason

        Reply
        • Brian Clifton

          Did you use the “Update” notice below the code i.e. changing line 41?

          I am using that change (some servers handle such chars differently) and it works for me.

          Reply
          • Peter

            UPDATE: Apologize Brian! Your line 41 update actually solved it. My mistake. Sorry again and thanks for helping out

  25. nicolaos

    Thanks for this Awesome post ! Quick question: Why not treat EMAIL query as all others like firstname and telephone.
    In order words, why not remove all the value of the email query instead just the 4 keys before and after the @

    Thanks

    Reply
    • Brian Clifton

      Thanks for the feedback. The simple answer is because you can i.e. there is a nice anchor (the @) to use that other fields don’t have. That way, the owner of the data is able to understand exactly what email addresses are being captured. For example, it could be legitimate mailto links to resellers i.e. not really PII. There is just no way to do this with other potential PII.

      Reply
      • nicolaos

        Superb! Thanks.

        Any ideas how to implement this for Adwords too?

        Unfortunately couldn’t add a customTask for Adwords the same way I did for GA.

        Thanks in advance

        Reply
        • Brian Clifton

          No, this is specific to GA at present…

          Reply
          • Daan (Maxlead)

            Hi nicolaos and Brian,

            What we’ve done in our case to prevent PII to be sent to AdWords is to simply create a GTM variable which compares the audited URL and compares it with the original page URL. If both of them are the same, nothing was audited (URL is OK), if they are not the same, likely PII was detected.

            Using the output from this variable either in your trigger for firing the AdWords tag(s), or by embedding this variable in an exception trigger, you can control whether the AdWords tag is fired or not; preventing PII to be sent to AdWords.

            Yes, you may miss some remarketing / conversion information, but at least PII will no longer be sent to AdWords (assuming that you’ve tackled all forms of PII possible on your webpage).
            Using Google Analytics you can find out what the problematic page is and make sure it will be fixed by the web developer. Ideally there should never be any personal information present in the page URL.

  26. Ankit

    Great post.
    Just one simple question.
    Do you know how I can use customTask for ‘AdWords Remarketing’ tags (to redact PII)? I use it very well for ‘Universal Analytics’ tags.

    Reply
    • Brian Clifton

      No, this is specific to GA at present…

      Reply
      • Ankit

        Thanks for the reply. I managed a workaround using custom HTML tag and triggering it at the ‘Page View’ event. Do you think its the right approach?

        Reply
  27. Marco

    Hi Brian,

    Thank you very much for posting this useful information.

    I am quite interested on implementing this mechanism on my website, which is coded over PHP instead of Javascript. Do you know if this is a limitation to implement GTM (i think the frame of code is just available on .js coding?) and consequently your custom JavaScript variable?

    Reply
    • Brian Clifton

      Hello Marco – GTM is a client-side JavaScript only library. There is no other way to implement it…

      Reply
  28. Travis B

    Any way to go about doing this in GA without using Tag Manager?

    Reply
    • Brian Clifton

      You really want to do this at the point of data collection i.e. via GTM (or other tag manager solution) and not once the data is already in your GA account. Essentially, once you have collected the PII data you have already broken data privacy laws – regardless of where it is then stored/processed.

      However, note this technique is a monitoring system for you – using GA to redact and then flag up PII issues with your website. Ultimately you these need to be flagged to the IT/Web Dev team to sort out the underlying issue as even if the data does not get into GA it is very likely to be in web server log files, router log files and firewall log files etc.

      Reply
      • Travis B

        Thanks Brian! I’m actually having to use analytics.js just for the sake of the way our e-commerce tracking is set up. I found out I can load analytics.js and then the ecommerce.js plugin without sending a page view. and then use GTM for page views and setting up this PII flagging. Great post! Thanks for the response.

        Reply
  29. Matt

    Thanks for sharing! I really like the idea of capturing more types of PII. However, it seems like this code snippet only works for the EMAIL and SELF-EMAIL patterns. For the other types of PII, it seems like the regex pattern isn’t getting applied to the right piece.

    It looks like the line
    “`
    hitPayload = sendModel.get(‘hitPayload’).split(‘&’)

    breaks the URL into an array of strings, removing and splitting it at any `&`’s.

    So, a starting URL of:
    “`
    /test?tel=+44012345678&email=brian@me.com&other=bclifton@DOMAIN.com&firstName=brian&password=hello
    “`

    Would look like this as a JavaScript array:
    “`
    [
    “/test?tel=+44012345678”,
    “email=brian@me.com”,
    “other=bclifton@DOMAIN.com”,
    “firstName=brian”,
    “password=hello”
    ]
    “`

    You then cycle through this array with a for loop:
    “`
    for (i = 0; i < hitPayload.length; i++) {

    }
    “`

    Inside the for loop, you break each string into another array of strings with by the `=`:
    “`
    parts = hitPayload[i].split('=');
    “`

    So a piece of the larger array like:
    “`
    "firstName=brian"
    “`

    Becomes:
    “`
    ["firstName", "brian"]
    “`

    You then decode and assign to `val` the second item in the array.
    “`
    val = decodeURIComponent(unescape(parts[1]));
    “`

    Here’s what looks like a problem to me: only "val" gets the regex pattern applied to it:
    “`
    piiRegex.forEach(function (pii) {
    val = val.replace(pii.regex, pii.group + '[REDACTED ' + pii.name + ']');
    });
    “`

    The code seem to be cycling through the PII regex patterns and applying both the pattern and replacement only to the values of any query parameters, but not the query. The anchor needs to be searching for terms like `firstName`, but I think it’s only seeing values like "brian" all the way down.

    You then encode the value again, join each part back by the =, and join everything back up into one url with the &.
    “`
    parts[1] = encodeURIComponent(val);
    hitPayload[i] = parts.join('=');
    } // end for loop
    sendModel.set('hitPayload', hitPayload.join('&'), true);
    “`

    Am I missing something? It seems like the regex pattern is only getting applied to the second part of `parts` or the value in the query parameter (e.g. “brian"), but never the full query parameter (e.g. “firstName=brian")

    Reply
    • Brian Clifton

      @Matt – I think you are mixing up what is actually being checked here. The routine does not go through each name/value pair of the URL. Rather it goes through the name/values pairs of the measurement protocol hit i.e. the hitPayload.

      For MP parameters, see https://developers.google.com/analytics/devguides/collection/protocol/v1/parameters

      So for a URL containing suspected data, the name/value check will be on the dl parameter i.e. the URL as a whole, and that’s where the regex check takes place. Then the foreach loop goes to the next MP parameter.

      HTH

      Reply
      • Matt

        Thanks. I was indeed assuming the “hitPayload” was a URL.

        Reply
        • Brian Clifton

          No problem. The benefit of this method is that ALL data types are checked – not just pageview URLs i.e. event data, e-commerce data, custom dimensions, campaign parameters etc.

          Reply
  30. Simms

    Great post but one small question.

    Is it possible to just make this apply to GA and not our own software system. Is this code a blanket block for all tags or could some tags be on. White list?

    Reply
    • Brian Clifton

      Hello @simms – this method uses a Custom JS variable that is applied via Google Analytics’s new customTask feature. You apply it to which ever GA tags you need, but this will not work with non-GA tags unless the customTask feature is available to use. Note: customTask is a very new feature (Aug 2017) and I am not aware of other tags that can use this.

      That said, if you are collecting PII in plain text e.g. within URLs (regardless of where you are sending it), then you have a privacy issue as it will be logged by default on your webserver and on every router the URL passes through…

      Reply
  31. Stu Bowker

    Thanks Brian. Might be worth adding ‘username’ too.

    Reply
    • Brian Clifton

      Yes exactly. I would customise the regex for your environment. Accountname, uname, user_name, customer etc., are all English possibilities. And if you work in multiple markets you will need to consider others as well e.g. kund(er) for the Swedish market. Essentially, a generic regex for all users doesn’t really work, it needs to be tailored…

      Reply
  32. Jon Hibbitt

    Thanks! Awesome post and nice enhancements on Simo’s original. This will be going into the toolbox for sure.

    Reply
  33. Yehoshua Coren

    Hi Brian,

    I’m surprised that you included [redacted self-email] as an issue with regards to GA TOS. I would argue that the Terms of Service are concerned with personally identifying visitors to your site, and are not concerned with the personal details of the users of the website itself.

    As such, I find the customTask to be way to broad with regards to how it redacts information. My take is that doing things such as tracking who the user is trying to contact is not problematic. That applies to phone numbers tapped and mailto: links clicked. Additionally, while I would agree that a users address is out of bounds for data collection in GA, I don’t see how a postal code is a problem. Please argue the other side with me in the comments; I’m open to hearing it.

    Hope you’re doing well.

    Yehoshua

    Reply
    • Brian Clifton

      Hej @Yehoshua – although capturing your own orgs’ email addresses with GA reports is not strictly PII, it would likely be picked up by Google’s compliance robots. And who knows what that would result in – possibly data being deleted, your account suspended etc. Although you would be correct to argue the fine nuance of the reality, my suggestion is to avoid it in the first place i.e. good luck talking to a human compliance/legal officer at Google to make your case!

      In terms of zip codes, these can get very specific. For example, in the UK they can be limited to a set of 5 houses. Also, I found this article specific to Canada:

      “…it was found that 87.9% of the postal code locations were within 200 meters of the true address location (straight line distances) and 96.5% were within 500 meters of the address location (straight line distances).”

      https://ij-healthgeographics.biomedcentral.com/articles/10.1186/1476-072X-3-5

      Reply
      • Stu Bowker

        I believe that capturing the first part of a UK postcode is OK because it’s at a city level, not street. For example, if the postcode equals “BA1 1AA”, then collecting just “BA1” in GA is fine.

        Reply
        • Brian Clifton

          Most likely true for the UK (always check such things with your legal/compliance team) and the regex can be customised accordingly. I would suggest if the post code is present, then the street address should also be checked and redacted…

          Reply
      • Milos

        Precisely. As one GA expert once told me, don’t f*k with it 🙂
        If there is a remote possibility that GA will interpret the data as PII, don’t import it. Great post.

        Reply

Submit a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share This