Noise or Music? - The Insights Blog

Jumpstart Regular Expression Tutorial for Google Analytics users

May 11, 2012 / Categories: Google Analytics specific, Metrics understanding / Comments: 11

Share Button

If you manage a Google Analytics account, then understanding regular expressions – and how to set them up – is a key part of your job. This tutorial is intended to jump start novice users into the world of regular expressions – specifically from a Google Analytics point of view.

As you will see from reading my books, most regular expressions I use are pretty straightforward – so you shouldn’t be deterred from delving in and understanding them inside-out. However, I have found over the years that many users are scared off the subject. My reasoning is that the general studying of regular expressions can rapidly become complex and overwhelming. However, the truth is for the vast majority of GA work, you don’t need the full power or complexity that regular expressions (built for the IT industry) can provide. Hence, I created this jump start tutorial for you to focus on….

A Quick Introduction…

Regular expressions, also referred to as regex, are a way for computer languages to match strings of text, such as specific characters, words, or patterns of characters. A simple everyday example of regular expressions is using wildcards for matching filenames on your computer. For example, *.pdf matches all filenames that end in .pdf. However, regex can be much more powerful than this. Within google analytics, regular expressions are primarily used when creating profile filters (see Chapter 8 of the book), advanced segments (also Chapter 8), and table filters (Chapter 4).

Understanding the Fundamentals of Regex

An important point to grasp when using regular expressions is that there are two types of characters: literals and metacharacters. Most characters are treated as literals. That is, if you wanted to match a URI containing advanced, you would type the literal character "a", followed by "d", followed by "v", and so forth (without quotes).

The exceptions to this are metacharacters. These are characters of special meaning to the regex engine and therefore interpreted differently. For example, the PDF example shown in the Introduction contains the metacharacter "*" (without quotes). The most common metacharacters are listed in Table 1. Ensure that you understand these before proceeding.

Table 1 – Common regular expression metacharacters

Metacharacter

Description

.

Matches any single character.

[   ]

Matches a single character that is contained within the square brackets. Referred to as a class.

[^   ]

Matches a single character that is not contained within the square brackets. Referred to as a class.

^

Matches the beginning of the string. This is referred to as an anchor.

$

Matches the end of the string. This is referred to as an anchor.

*

Matches zero or more of the previous item.

?

Matches zero or one of the previous item.

+

Matches one or more of the previous item.

|

The OR operator. Matches either the expression before or the expression after the operator.

\

The escape character. Allows you to use one of the metacharacters for your match.

(   )

Groups characters into substrings.

NOTE: Google Analytics uses a partial implementation of the Perl Compatible Regular expressions (PCRE) library. I use the word partial because a full implementation is more powerful and flexible than a software as a service vendor would want it to be! For example, if its use is unrestricted, it can be used maliciously to hack or break a website. Therefore, not every feature of PCRE is included in Google Analytics…

The best way to learn Regex is by example…

Using only literals, you can construct simple regular expressions. First, partial matches are allowed. For example, say you wanted to view only referrals from the website www.google.com. Using a regular expression, you could use the partial keyword "goog" in the table filter of your Traffic sources > Sources > All Traffic report. This will match all entries that have the letters "goog" in them, as shown in Figure 1.

Figure 1 – Table filter using a partial literal match

Click for full size

NOTE: The break down of geographic google domains shown in Figure 1 is achieved by using the Custom SEO plugin for GA.

Being simple to use, literals can be very powerful—as long as you can identify a unique pattern match that includes the string of interest. Taking the previous example, to be more specific, use the OR metacharacter, as in this example:

google\.(com|co\.uk|ca)

This matches the literal google, followed by a period (this must be escaped because it is also a metacharacter), followed by com OR co.uk (period also escaped) OR ca. The result is shown in Figure 2.

Figure 2 – Table filter using the metacharacter OR

Click for full size

NOTE: Google Analytics automatically escapes periods in the report table filter and advanced segments for you. Therefore, you can omit the escape charter (\) for these. However, when you are learning regex, I advise you to always escape these yourself as best practice. This is because profile filters, as well as goal or funnel configurations, do not have the automatic escape feature.

You will notice from Figure 2 that subdomains of Google are present in the reports. Suppose you wish to remove these from your matches. Modify the regex query as follows:

^google\.(com|co\.uk|ca)

This results in only referrers that start with the pattern google being matched. Another example to practice with includes:

^go.+le\.((com$)|(co\.uk)$|(ca)$)

This extends the previous example to explicitly match only Google domains that end in .com, .co.uk, and .ca. This removes referrers such as google.com.au, google.com.br, and so forth, as shown in Figure 3. Note that I have also been a little lazy and used go.+le to illustrate how to use the + metacharacter. That is, it is used to match one or more of the previous character—in this case, any character.

Figure 3 – Table filter using multiple metacharacters

Click to view full size

The following are examples to consider when matching URLs listed in your Content / Top Content reports:

\?(id|pid)=[^&]*

This matches the filename followed by the first query parameter and its value if its name is equal to id or pid. If you have a report with URIs of the following form, this regex will match the two URIs highlighted:

/blog/post?pid=101

/blog/post?id=101&lang=en&cat=hacks

/blog/post?lang=en&cat=hacks&id=102

/blog/about-this-blog

Typically, this regex format is used when defining a goal or funnel step. Note the use of the negative class to stop the regex match. That is, this regex will match all characters after id= or pid= that do not contain &. An asterisk is used (*) to also match zero occurrences of & so that even if there is no second query parameter present, as per the first URI, the regex will still match.

An example that is useful when filtering within Keyword reports (search engines and internal site search) is to consider misspellings. Perhaps you need to find all matches for “colour” and “color.” The following regex will achieve this:

colo[u]*r

Here are some other misspelling examples:

Voda(ph|f)one

Ste(ph|v)en

Br[ai][ai]n

(My name is sometimes spelled Brain!)

Finally, although not directly relevant to Google Analytics, a common regex used in web development for processing forms is:

^(.+)@([^\(\);:,<>_]+\.[a-zA-Z.]{2,6})

Use this to test your understanding. Broken into its constituent parts, this regex checks an email address to ascertain if it is a valid format—that is, brian@mysite.com and not brian@@my_site:com, for example. From left to right, the English interpretation is as follows:

  • Match one or more of any character before the @
  • Match any character after the @ but do not include any of following characters: ( ) ; ; , < > _
  • Followed by a period
  • Followed by between two and six characters that must include an alphabetic character (A–Z as either upper- or lowercase) or a period

I have highlighted the middle section of this regex to help guide your eye, that is, the part between the @ and first period.

If you have followed these examples, you are well on your way to understanding regular expressions for use with Google Analytics. If not, reread this post and use one of the regex tools listed in Appendix B of the book. Further regex examples are shown throughout the book, though none are more complicated than those shown here.

Tips for Building Regular Expressions

  • Make the regular expression as simple as possible. Complex expressions take longer to process or match than simple expressions.
  • Avoid the use of .* if possible because this expression matches everything zero or more times and may slow processing of the expression. For instance, if you need to match all of the following: index.html, index.htm, index.php, index.aspx, index.py, index.cgi

use

index\.(h|p|a|c)+.+

not

index.*

  • Try to group patterns together when possible. For instance, if you wish to match a file suffix of .pdf, .doc, and .ppt

use

\.(pdf|doc|ppt)

not

\.pdf|\.doc|\.ppt

  • Be sure to escape the regular expression wildcards or metacharacters if you wish to match those literal characters. Common ones are periods in filenames and parentheses in text.
  • Use anchors whenever possible (^ and $, which match either the beginning or end of an expression), because these speed up processing.

Some useful regex tools to help you

I have used all of these though, I would love to hear about others:

Was this post useful…? Please let me know by adding a comment or sharing the ‘social love’ with a tweet, +1, Like etc…

Share Button

Comments

  1. JB says:

    Great post. Thanks for breaking it down with examples helpful for GA!

  2. Jesper says:

    Great post! I learned a lot! However, I don’t really understand your example where index\.(h|p|a|c)+.+ is supposed to match index.html, index.htm, index.php, index.aspx, index.py and index.cgi. Could you explain?

    Thanks!
    Jesper

  3. For anyone who is scratching their head about how to define a very simple wildcard as a goal in Google Analytics here is how you do it (it’s not clear from the instructions).
    e.g. say your site had a number of special offers pages, and you wanted to count a page view of any page or sub directory that included the word

    offer

    as and ‘View Of An Offer Page’ Goal…

    When setting up the goal, you simply select ‘regular expression match’ from the drop down box, and type in

    offer

    in the Goal URL without any extra characters or symbols at all.

  4. Thanks for the tutorial. It is very handy in analyzing and understanding Google Analytics data.

  5. Tom says:

    Regex is fast becoming an essential skill for any SEO/web analyst. I’m not technical but learning regex has really improved my reporting capabilities! Have just ordered your new book – keep up the good work Brian :)

  6. Paul Brown says:

    Hello Brian, I’m certainly going to buy your book but I wondered if you also offer any online training? Thanks! Paul

  7. Jesper says:

    Very nice explanation. I hate regular expressions, but this post will definitely help when I can’t avoid them :)

    Thanks a lot,
    Jesper

  8. Steven says:

    Brian, thank you for this wonderful, step-by-step introduction.
    Just to make sure I am understanding the basics correctly, here are some follow-up questions. Your examples are in front:

    ^go.+le\.((com$)|(co\.uk)$|(ca)$) > is there a difference to: ^go.+le\.((com$)|(co\.uk$)|(ca$))

    colo[u]*r > would this be even more concise: colo[u]?r

    Br[ai][ai]n > did you mean: Br(ai|ia)n

    If you could maybe address the “English translation” of these examples as well as the action of each individual building block.

    Thanks

  9. @Frank: are you referring to the Best Add-ons post? If so, I use all the browser tools – just not the last two desktop apps (11 and 12), as these are for Windows and I am now Mac convert :)

  10. Frank says:

    Excellent post and information. I’ve read a few articles on the topic but I like how you’ve broken down the details here, I’m sure i’ll refer to it from time to time.

    Of the tools you listed in that other post which do you use/recommend the most?

Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *


× 2 = 12

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

© Brian Clifton 2015
Best practice privacy statement