Data spam has always been present within web analytics reports. However over the past year or so referrer spam has become a real PITA. By referrer spam, I mean spammers and scammers polluting your Google Analytics reports with their junk links in the hope you will say – “Oh what is that? Let’s visit the referral site that is sending us traffic and see who they are“. Of course the purpose is to to drive traffic to their own site for ad impressions, or to push malware down your throat. Just like email spam, its annoying and a time-waster – but unlike email spam it is not so in your face. That means referral spam lies below the radar, buried in your reports. The result is its data distorting effects can often go unnoticed.
What are the distorting effects of data spam?
The obvious one is inflated traffic numbers – visitors, sessions and pageviews. However, the impact is much more substantial than simple traffic numbers. For example, spam visits are just that – low engagement, non-converting, high bounce-rate traffic. They skew all “success metrics” downwards i.e. because the denominator contains junk when you look at any performance ratio or percentage.
The problem is significant as the main impact is on “referral traffic” i.e. those highly qualified visitors (leads) you receive form all the hard work you do building your partnerships, affiliates and placing links within your social media discussions. Therefore, referrals are valuable traffic. When you view the inflated impact on your referral visitors only, spam can account for as much as 50% of it. That renders assessing your referral performance all but impossible…
Apply two filters to fix…
- Add a Hostname Filter – only allow your own domain name to send data to Google Analytics
- Add a Referral Source Filter – remove spam referrers
Apply these two View filters to your data and you can almost eliminate your referral spam. Note you need admin rights for your GA account to do this:
1. The Hostname Filter
This straightforward filter tells GA to only includes data into your account if it has come from your website i.e. any third party site is excluded. Replace
mydomain\.com with yours. Note you must escape the special character “.” with a backslash.
If you have multiple domains you are tracking use a regex OR statement e.g. for two domains:
The pipe character “|” means OR.
If you have numerous country or market specific domains e.g. mydomain.com, mydomain.co.uk, mydomain.cn, mydomain.de and so forth – just use the common string, mydomain in the regex. In other words, you do not need to list them all out fully as they contain the same matching text.
Why is hostname ‘googleusercontent’ present?
This is the host name used by Google when people use Google Translate on your pages, or your content is delivered by Google’s cache. The actual hostnames are: webcache.googleusercontent.com and translate.googleusercontent.com. Using ‘googleusercontent’ within your filter will capture both and allow such content to be displayed in your reports. [ Thanks to Thomas Geiger for this extra tip ]
2. The Referral Source Filter
This filter removes most polluting referrers. The full filter pattern is too long to show in the screenshot, so copy and paste my regex below into your filter (on one line). It’s that easy!
The referral source filter works pretty good for a range of websites, but is not a definitive list – organisations will have different spammers targeting them. Therefore apply this filter as your first step and monitor the effect – then adjust the regex as you see fit.
As a tip, I recommend you collect those spammers into a separate report set (View) i.e. setup an include filter with the same regex. That way you get to see exactly what referral sources are being filtered out and you can spot any false positives.
What about historical data?
Filters are a must-have configuration for removing spam going forward. Therefore ensure adding these filters is a part of your setup ABCs. However, if you have historically collected spam in your Google Analytics reports you will want to remove this as well. That cannot be permanently done using a filter – instead, you apply a segment when viewing your reports (here’s a post to help you understand the difference between a filter and a segment).
Apply the regex form the filters above to a segment to remove spam as follows:
- If you are new to regex, try my Jumpstart Regular Expression Tutorial for GA users.
- Applying filters manipulates your data in a way that cannot be undone for historical data. Therefore always test View filters first – use a separate report set (View) for this.
- As well as excluding spam referrers for obvious reasons, collect them into a separate report set (View) by reversing the filter logic. That way you can see exactly what you are excluding – in case there are any false positives!
- View Filters are limited to 255 characters in length. Therefore be creative/smart with your regex – to see the range of sites my regex captures create a separate View of your data with my filter in reverse. That is, an include filter. You can also use cascading filters i.e. combining multiple filters to cater for long regex matches.
- Do NOT use the Referral Exclusion feature of GA to remove spam – see my/David’s comments
- (Not a tip, but..) I still cannot get used to the renaming of Profiles to Views…