Noise or Music? - The Insights Blog

How to Remove Referral Spam from Google Analytics

Categories: Google Analytics and GTM, Implementation Fundamentals / Comments: 25

Share Button

How to Remove Referral Spam from Google AnalyticsData spam has always been present within web analytics reports. However over the past year or so it has become a real PITA. By data spam, I am referring to spammers and scammers polluting your Google Analytics reports with their junk links in the hope you will say – “Oh what is that? Let’s visit the referral site that is sending us traffic and see who they are“. Of course the purpose is to to drive traffic to their own site for ad impressions, or to push malware down your throat. Just like email spam, its annoying and a time-waster – but unlike email spam it is not so in your face. That means referral spam lies below the radar, buried in your reports. The result is its data distorting effects can often go un-noticed.

What are the distorting effects of data spam?

The obvious one is inflated traffic numbers – visitors, sessions and pageviews. However, the impact is much more substantial than simple traffic numbers. For example, spam visits are just that – low engagement, non-converting, high bounce-rate traffic. They skew all “success metrics” downwards i.e. because the denominator contains junk when you look at any performance ratio or percentage.

The problem is significant as the main impact is on “referral traffic” i.e. those highly qualified visitors (leads) you receive form all the hard work you do building your partnerships, affiliates and placing links within your social media discussions. Therefore, referrals are valuable traffic. When you view the inflated impact on your referral visitors only, spam can account for as much as 50% of it. That renders assessing your referral performance all but impossible…

Apply two filters to fix…

  1. Add a Hostname Filter – only allow your own domain name to send data to Google Analytics
  2. Add a Referral Source Filter – remove spam referrers

Apply these two View filters to your data and you can almost eliminate your referral spam. Note you need admin rights for your GA account to do this:

1. The Hostname Filter

This straightforward filter tells GA to only includes data into your account if it has come from your website i.e. any third party site is excluded. Replace mydomain\.com with yours. Note you must escape the special character “.” with a backslash.

hostname filter

If you have multiple domains you are tracking use a regex OR statement e.g. for two domains:
mydomain\.com|myotherdomain\.com
The pipe character “|” means OR.

If you have numerous country or market specific domains e.g. mydomain.com, mydomain.co.uk, mydomain.cn, mydomain.de and so forth – just use the common string, mydomain in the regex. In other words, you do not need to list them all out fully as they contain the same matching text.

Why is hostname ‘googleusercontent’ present?

This is the host name used by Google when people use Google Translate on your pages, or your content is delivered by Google’s cache. The actual hostnames are: webcache.googleusercontent.com and translate.googleusercontent.com. Using ‘googleusercontent’ within your filter will capture both and allow such content to be displayed in your reports. [ Thanks to Thomas Geiger for this extra tip ]

2. The Referral Source Filter

spam referral filter

This filter removes most polluting referrers. The full filter pattern is too long to show in the screenshot, so copy and paste my regex below into your filter (on one line). It’s that easy!

offer|free\-|share\-|buy|cheap|semalt|googlsucks|hulfington|buttons|
darodar|money|blackhat|backlink|webrank|seo|phd|
crawler|anonymous|\d{3}.*forum|porn|webmaster|flipboard|fl\.ru|
mbca|ahrefs|game|\.io|^sex|^video

The referral source filter works pretty good for a range of websites, but is not a definitive list – organisations will have different spammers targeting them. Therefore apply this filter as your first step and monitor the effect – then adjust the regex as you see fit.

As a tip, I recommend you collect those spammers into a separate report set (View) i.e. setup an include filter with the same regex. That way you get to see exactly what referral sources are being filtered out and you can spot any false positives.

What about historical data?

Filters are a must-have configuration for removing spam going forward. Therefore ensure adding these filters is a part of your setup ABCs. However, if you have historically collected spam in your Google Analytics reports you will want to remove this as well. That cannot be permanently done using a filter – instead, you apply a segment when viewing your reports (here’s a post to help you understand the difference between a filter and a segment).

Apply the regex form the filters above to a segment to remove spam as follows:

spam referrer segment

Tips:

  • If you are new to regex, try my Jumpstart Regular Expression Tutorial for GA users.
  • Applying filters manipulates your data in a way that cannot be undone for historical data. Therefore always test View filters first – use a separate report set (View) for this.
  • As well as excluding spam referrers for obvious reasons, collect them into a separate report set (View) by reversing the filter logic. That way you can see exactly what you are excluding – in case there are any false positives!
  • View Filters are limited to 255 characters in length. Therefore be creative/smart with your regex – to see the range of sites my regex captures create a separate View of your data with my filter in reverse. That is, an include filter. You can also use cascading filters i.e. combining multiple filters to cater for long regex matches.
  • Do NOT use the Referral Exclusion feature of GA to remove spam – see my/David’s comments
  • (Not a tip, but..) I still cannot get used to the renaming of Profiles to Views…
Share Button

Comments (most recent first)

  1. Matt says:

    GA’s filter won’t accept Russian (cyrillic) characters. How do you deal with these when creating a filter – is there a way to escape them? I tried using a backslash and it did not work.

    Here is one of the spammy referrers: lifehacĸer\.com (notice the “k” is a cyrillic “ĸ”).

    • @Matt – Thanks for the nudge. I am about to look at this post again in the near future. Particularly with reference to spam coming in via the language settings…

      A quick solution to your problem could be to use:
      lifehac(.+)er\.com

    • Georgi says:

      Hi Matt,

      The filters in GA do accept Cyrillic characters, but the “k” in this spam referrer is not a Cyrillic one, instead it’s a “small capital k”. You can definitely include it in the filter, if you wish to do so.

      However, the best approach would be to create an exclude filter on the Language dimension instead, since this same type of spam comes from many referrers, even legitimate ones such as reddit.com and the language dimension is the one that is common to them all. You can read more on my detailed post here: http://blog.analytics-toolkit.com/2016/language-spam-latest-google-analytics-spam/

  2. Yes, there are many reasons due to which the content go into junks and spams..I am trying different strategies for preventing this problem..Can you tell me that from where i can get these filters? which is the best company that offer these services..
    Thanks,
    Jennifer

  3. Hi,
    Great post!
    I would to mention a tool we just released: http://www.saystoptospam.org/
    This is an easy way to share the spam referrers you blocked in our reports.
    We use them to publish, every week, an updated Google Analytics segment to be applied on your reports.
    It’s totally free.
    Could you spread the word on one of your social accounts or on your blog?
    Thanks

  4. Dennis says:

    We collected around 425 domains now and we need to create almost 50 filters. Just crezy!

    To automate this task we created “Google Analytics Referrer Spam Killer” at https://www.adwordsrobot.com/en/tools/ga-referrer-spam-killer

    We found another method around the web; checking screen resolution to be not “(not set)”. That might work too.

    If you have any feedback on the tool, it is very welcome.

    Dennis

  5. Geo says:

    Brian, that’s all fine, but what do you do when you need to apply the filters to hundreds of web propertes. And keep them updated afterwards, no less… Google, it seems, is nowhere near providing us a solution to this, so I developed my own: https://www.analytics-toolkit.com/auto-spam-filters/ . A fully-automated referrer spam blocker / protection tool. You might want to check it out if you manage more than a couple web properties in Google Analytics.

  6. Thanks Brian, I am getting busy adding filters to my sites, wish I had done it a year ago… I am more annoyed by them creeping into my events and campaigns than my referrals. Anywhoo, I know this post is all about removing spam, but I wanted to share some research I did on the size of the problem… I looked at 77 websites and found they had (in June) 14.4 spam referrers and totaling 217.7 visits. Check out the research,and, if you have 5 minutes follow the directions to add your own data to my benchmark. Check it out at

  7. Amit Ramani says:

    Thank you much for this useful blog post. I just implemented both of these filters. I am delighted to find such clearly explained Google Analytics tips!

  8. Chris says:

    Your tip to collect the “spammers” into a separate report set and check them should certainly be underlined, Brian. I’m not sure what poor old pistonheads.com has done to offend you.

  9. John says:

    One thing I’ve done for a client site which was hit hard by this kind of referral spam was that, when I filtered it out, I set up a new view which only includes these visits – that way I can make periodic checks to ensure that I’m not accidently filtering out potentially ‘real’ visits.

  10. Doug Hall says:

    Solid article Brian and all reasonable comments too although my preference is to keep the ‘fix’ in the GA config using the filters as described rather than using htaccess mods. The fix is specific to GA so my preference is not to pollute other systems – keep the fix decoupled from other systems.

    Whilst the suggested filters are good practice and are representative of a solid GA install (these techniques are just good practice) they are not going to stop ALL referral spam. We’re still vulnerable to Measurement Protocol injection.

    Is this a bad thing? Does this reflect badly on GA as a vulnerability?

    No to both questions. GA, like all similar low latency measurement systems that are built on pixel requests, javascript, cookies and the HTTP protocol are inevitably open to such abuse. This is not a fragility or flaw specific to GA. Indeed, given the features of measurement systems such as GA, this behaviour is a reality and needs to be dealt with as part of the effort to maximise data quality.

    We consider event tracking, virtual pageviews, transaction tracking carefully as part of ‘data quality’. The measurements we take are not trivial decisions and such care is required in all aspects of data capture – both intentional and accidental.

    In striving for high quality data we consider how we use the data, what it’s for, who will use it and how to deliver it. We make choices to normalise sanitise data to make it actionable and fit for purpose. If we have data pollution, we act to mitigate. Referral spam falls under the banner of pollution. We deal with it.

    The effort required to deal with the pollution isn’t massive. The urgency you apply to the fix will depend on the impact on your data. If referral spam is high enough to impact your data to an appreciable degree, it’s quite possible spam is less of an issue than the data volume you’re collecting anyway. By that, I mean If your data volume is small enough to be impacted by spam, was the data actionable in the first place? Was it rich enough to base reliable business decisions on? Were you making calls on insignificant data?

    Now, I’d be ASTONISHED if the Google Analytics team weren’t aware of the issue. Indeed, the addition of the ‘Exclude all hits from known bots and spiders’ functionality is evidence that Google DO take ‘automatic’ data quality serious enough to act on it.

    Take the advice Brian gives in this article. Act on it. Use your data wisely and appropriately. Be aware of changes in GA that can help you. I’ve no doubt Brian will add some notes to this post when Google add further support.

  11. Sandra says:

    what is your opinion on using the referral exclusion list for this?

    • Using the referrals exclusion list won’t work. Using it will hide the problem instead of fixing it since the traffic will end being showed as Direct Traffic.

      Referrals exclusions list are for modifing how GA will attribute the visits from those domains, not excluding them. (excluding those domains from being able to set a source/medium and instead treat those visits as Direct Trafic)

    • @David is spot on. The Referral Exclusion list is used for something different i.e. when you use a 3rd-party payment gateway to process customer credit cards. When used correctly, if the visitor returns to your site within 30 mins the 3rd-party gateway will not overwrite the original visitor referrer.

      Its a confusingly named feature…! But DO NOT use it to remove spam referrals.

      As David points out, if you do, the spam referrals will disappear from the referral report, but will be added to your “Direct” attribution – creating another problem…

  12. Sandra Padilla says:

    I also use RewriteCond %{HTTP_REFERER} on the htaccess, as suggested on this article https://moz.com/blog/how-to-stop-spam-bots-from-ruining-your-analytics-referral-data.

    # Block Russian Referrer Spam
    RewriteEngine on
    RewriteCond %{HTTP_REFERER} ^http://.*ilovevitaly\.com/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*ilovevitaly.\.ru/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*ilovevitaly\.org/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*ilovevitaly\.info/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*iloveitaly\.ru/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*econom\.co/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*savetubevideo\.com/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*kambasoft\.com/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*buttons\-for\-website\.com/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*semalt\.com/ [NC,OR]
    RewriteCond %{HTTP_REFERER} ^http://.*darodar\.com/ [NC]
    RewriteRule ^(.*)$ – [F,L]

    Filters can be a practical and faster alternative if you don’t have access or privileges to modify the htaccess. Plus when modifying htaccess you need to be uber careful, one character out of place is enough to take the site down

  13. @Thomas – good tip. I have modified the hostname filter for this.

    @Patrik – you make a good point – which is, these filters need to monitored to ensure they work as intended for YOUR site (I do not have valid links on .io domains)

  14. I already found 160 referral spam domains and “compressed” them into 8 Google Analytics filters. To make sure everything works well, I’ve also made a self-service tool, so you don’t have to be geek to have them implemented in few clicks and seconds.
    Feel free to check my beta.

  15. Adding only your own valid hostnames to the hostname filter could cause to loose data. webcache.googleusercontent.com and translate.googleusercontent.com could be also valid hostnames for example. I would first look at the hostname report and check what else should be included in the hostname filter as well.

  16. Patrick says:

    Other than finding the .io being filtered out (lot’s of valid pages use .io if your business is in the tech branch) the second filter is very useful.

    I have an extra view set up using a custom report to show which hostnames are coming up with my analytics code and use that in a second “main” view to filter those false hosts out.

    There is a lot we can do with Google Analytics that is still hidden 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Anti-spam question (required):

© Brian Clifton 2018
Best practice privacy statement