Eight Steps to Cleaner Data in Google Analytics

Tom Petryshen
15 min readMar 22, 2019

Note (Dec 3, 2020): These instructions are for Google’s Universal Analytics only. They do not apply to Google Analytics 4 which does not currently provide any way for you to filter results. I expect Google to add this functionality in the near future.

Whether you look at your data in Google Analytics once a month, week or day, there’s a good chance you’ve come to trust it as an accurate source of activity from your website or app. Heck why wouldn’t you. It’s Google. They must know what they’re doing.

The truth though while Google Analytics is a great product, the data is rarely 100% clean and could easily lead you to an incorrect analysis.

The good news is that there are ways to normalize and cleanse the data to give yourself and your team a more accurate picture of results. This can be done using various methods including fixing the issue at the source, as the data is being processed using filters or after the fact with segments and BI tools.

To help guide you towards a cleaner view, here are eight steps to improve the quality of your data in Google Analytics.

  1. Secure your site
  2. Tag your sources
  3. Update organic search
  4. Normalise your data
  5. Exclude internal traffic
  6. Block the bots and spiders
  7. Remove anomalies
  8. Remove duplicate transactions

1. Secure your Site

One often misunderstood trouble spot for Google is how it treats clicks from a secured site (HTTPS) to an unsecured site (HTTP). If your site is still not secure (HTTP rather than HTTPS) and you have links from a growing number of sites that are, there’s a good chance an increasing percentage of your referral traffic will be stripped and masked as Direct Traffic. The quick solution is to secure your site to ensure any inbound links from HTTPS pages go to HTTPS pages on your own site.

Outside of encrypting any data that you’re collecting via forms you’ll also help ensure Google Analytics is treating the traffic appropriately.

2. Tag your Sources

This one is a simple one and it’s all on you. If you’re advertising across multiple channels or running email campaigns, it’s important to add tracking parameters to your campaigns. If you miss this one step, there’s a good chance your paid campaigns will be reported as Referral traffic, Direct or Other as shown here.

Default Channel Grouping showing Other channel with misplaced traffic

A closer look in (Other) shows that there are 3 campaigns with no UTM Medium tag set and another (i.e. boosted) that Google can’t interpret.

Referral source showing missing Medium tag

The end result is a huge bucket of traffic that doesn’t sit in the right location.

To fix the issue, you can either advise your marketing agency to apply the correct UTM parameters (using Google’s Campaign URL Builder) at the source or adjust the Default Channel Settings to capture the misplaced campaigns as the data is being processed.

As I like to treat Paid Social as a separate channel from organic Social, I prefer to create a new Channel Definition like the one below (see View > Channel Settings > Channel Grouping in Admin) and set the Source / Medium accordingly.

New channel to sort paid social campaigns

The key to remember here is that any changes to the current Default Channel Groupings will permanently change how new traffic is classified in Google Analytics. Therefore, it’s important to verify that it’s working. Also, keep in mind that historical data is not affected. If you wish to filter historical data, you’ll need to use a BI tool such as Google Data Studio, Microsoft PowerBI or Tableau to move it into the right basket.

Like misplaced social campaigns, the same is true for email campaigns. If you haven’t turned on tagging in your email application, it will end up misrepresented in your overall story as either Direct or Referral traffic.

Once again the simple fix is to turn on tracking in your email platform or add a the UTM tracking parameters to you campaign URLs so that Google understands where the traffic came from and can apply it to the correct source manually.

3. Update Organic Search

Though Google dominates organic search in the US and many other countries they are by no means the only player in town. In fact, there are a number of other existing and new entries in the space that tend to get missed by Google Analytics. In fact, it’s not uncommon to find search engines like Yahoo, Ask and DuckDuckGo sitting in your Referral basket.

Misplaced organic search traffic sitting in Referral channel

Thankfully Google has provided an easy way to add obscure search engines (i.e. Search Encrypt) and not so secure search engines (i.e Yahoo) to the Organic Search basket.

In Admin go to Property > Tracking Info > Organic Search Sources and add each of the search engines from your referral list as shown here.

Example: Organic Search Source filter

Once you’ve updated the list of Organic Search Sources, you should start to see this traffic sorted accordingly. I recommend checking the Referral list over the coming week to ensure Google is getting it right.

In addition, it’s important to verify that the traffic from these search engines is organic and not a result of ads with missing tracking parameters.

4. Normalize your Data on the Fly

A common issue with Google Analytics is that you may see separate entries for what are effectively the same campaign name, content or page URL. The reason for this is that Google Analytics is case sensitive and captures each string as unique and distinct from others. In other words, if you’ve used campaign names for your marketing campaigns that include both lowercase and Uppercase (such as red dresses, Red Dresses and Red dresses) Google Analytics will treat them as separate campaigns rather than one in your reports.

For best practices, it’s important to account for casing through the use of filters to normalize data during collection.

Example: New filter to normalise all campaign names in lowercase

For example, to fix the red dresses scenario described above, it’s as easy as creating a filter to normalise your campaign names into a single source as show in this example.

To be safe, the same fix could also be applied to Campaign Source, Medium, Term and Content.

The same is true for your page path URLs. If the URLs on your site use a mix of upper and lowercase characters it’s possible to see /ThankYou.html and /Thankyou.html as two different pages. To combine the two into one, create another filter as shown here.

Example: New filter to normalize path paths to lowercase

Outside of campaign tags and page paths it’s also helpful to combine the different variations of a source such as Facebook into a single source. A quick look at your Social channel will show that most social networks send traffic from a number of different URLs. For example, it’s not uncommon to see traffic from Facebook under the following domains:

  • m.facebook.com
  • l.facebook.com
  • facebook.com
  • business.facebook.com
  • lm.facebook.com

A quick way to combine the different variations into a single source is to create a new filter that finds and replaces each of the domains under a single source using the following search string:

(^|.*\.)facebook\.com$|^facebook$

Example: Search and Replace filter to combine all Facebook traffic as a single source

Though these quick fix can be quite helpful for standard normalisation, it likely won’t cover all scenarios, especially if your site collects the value of a user entry. Under this scenario, your users could potentially enter the name in multiple ways leaving you with multiple variations of potentially the same entry. The end result makes analysis more difficult and forces you to cleanse the data in other ways.

Outside of the Campaign Name, Page Path and Facebook filters covered above, filters can be applied to other types of scenarios such as excluding your IP address and removing query parameters appended to URLs.

Here are some of my go to filters:

  • Lowercase Campaign Name, Source, Medium, Content & Page Paths
  • Combine all Facebook, Linkedin, Twitter, Instagram & Pinterest sources
  • Remove all Query Parameters from Page Paths
  • Exclude Internal Traffic from IP Address
  • Include your Hostname (see section 5)
  • Exclude Referral Spam & Crawlers (see section 6)

For a full list with examples, check out Taavi Kalvi’s 30+ Google Analytics Filters.

If wrangling data in Google Tag Manager is more your style, then Yasen Lilov provides an elegant way of dealing with mixed cases and null data in GTM as well. Normalising null, false or undefined values with (not set) or N/A is important as Google Analytics will otherwise ignore hits from undefined values.

Example: How to normalise data in Google Tag Manager

By implementing filters and applying format values in Google Tag Manager, you’ll end up saving yourself heaps of time and frustration when you need to analyse the data.

5. Exclude Internal Traffic

This may sound like a no brainer, but if you work for a large company with employees who regularly access the company website on a regular basis, there’s a good chance their actions are diluting your data. To remove your own data, just add your company’s IP address(es) and any third party agency or developer IP addresses who browse the site for their work as a filter.

Example: IP address exclusion filter

6. Block the Bots & Spiders

If you spend any time reviewing the source traffic of your site, you’ve probably come across weird looking traffic sources that leave you scratching your head. With a plethora of opportunities to scrape content, search for known vulnerabilities and conduct a host of activities there are many known bots and spiders roaming the web to take what they find back to their owners hive. While many of these bots and spiders are fairly benign, they end up inflating your data and messing with your sense of cleanliness.

Example: Basic bot filtering in Google Analytics

To start the cleansing process, the first step is to enable Bot Filtering in the Admin area of Google Analytics. Under View Settings, click the checkbox under Bot Filtering to exclude all hits from known bots and spiders.

Sadly, you’re not done yet. There’s still more to do starting with ghost spam.

Exorcise your Ghosts

If you really want to ditch the dirty data, you need to block spam and other bots. One of the most common types of spam is from sources with fake hostnames, also called ghost spam.

Thankfully there’s a quick fix to cleanse your data of these annoying intruders. To do this, add a filter that only includes your hostname to stop all ghost spam at the gate.

Example: Filter to remove ghost spam

If you have more than one hostname, just separate them with a pipe character like example1.com|example2.com|example3.com to provide the cure.

ISP Organisation Networks

Interestingly, one common area of bot related traffic arrives direct from ISP organisations like Google, Alibaba, Microsoft, Facebook and others. To remove these ISP organisations, you need to create a Filter with the following expression:

hubspot|^google\sllc$|^google\sinc|alibaba\.com\sllc|ovh\shosting\sinc\.|microsoft\scorp|amazon\.com\sinc|amazon\stechnologies\sinc

Example: Filter to remove bots from common Internet Service Providers

Thankfully there are lists of known bots and examples on how to keep them from infecting your data. The example provided above is covered by Carlos Escalera who has put together The Ultimate Guide to Stopping Google Analytics Spam, Bots and Other Junk Traffic. Though, I’ve included some of the key areas in this section, Carlos has provided further instruction with the scripts to implement the required filters in much more detail.

It’s important to note that while the expression provided by Carlos block most bots, it may also block some legitimate traffic, especially from companies like Microsoft and Google. Therefore it’s important to test the expression on your traffic and verify whether the traffic should be cleansed (for example, if you see Sessions with 100% Bounce Rate, low Average Session Duration, and zero Average Pages / Session you’re likely dealing with bots).

Watch your Language

Another area bots can affect your data is by disguising the language and showing up as some out of place language. One problematic source is traffic the that shows up is with the language set to C which doesn’t officially correspond with any specific language.

The example below shows just over a hundred New Users hitting the site with low Average Session Duration and no Transactions.

Example of bot traffic with language disguised as c

And if you’re really lucky you might even receive some very special messages disguised as languages as shown here. Though it looks like there may be a few real users in the mix, the poor quality makes it worth filtering it.

Example of bot traffic with special messages

To remove the C language and special messages from your data, Carlos has once again created a simple filter to block the data. Just create a Custom Exclude filter with the following pattern:

\s[^\s]*\s|.{15,}|\.|,|^c$

Example: Filter to remove language spam

With the new filter applied, language C and other troublesome messages will be out of sight.

Eliminate Fake Referral Spam

Well before fake news entered the political discourse we had fake referrals also known as referral spam. Referral spam is a fake URL that registers as a source in Google Analytics in the hope you might visit the URL and learn what they have to offer.

While many large sites may not care as the annoyance is washed out by the sheer volume of traffic, there’s a good chance it will have a larger proportional impact on smaller sites and ultimately inflate your traffic volume metrics, skew your sales and reduce the accuracy of your engagement metrics. And, if you’re a purest and want only the cleanest view, then it’s worth the effort to clean it up.

A simple way to spot this type of traffic is to look for referrals with 100% Bounce Rate, a session duration of zero minutes & zero seconds, zero Transactions (or goals) and zero Revenue. Depending on the amount of traffic you receive, it should be fairly easy to spot as shown in this example.

Example: Referral Source Report in Google Analytics with Bounce Rate filter set to 100%

Alternatively, you can also create a custom report using the Source, Screen Resolution and Hostname dimensions combined with Sessions, % New Sessions, Avg. Session Duration, Bounce Rate and Transactions metrics to verify with additional data points.

Example: Custom Report in Google Analytics with Bounce Rate filter set to 100%

Clearly there’s on behind these sessions.

To clean it up, the first step is to stop the fake referrals from infecting your data. You can do this by setting up a new filter in the Admin area of Google Analytics (Admin > View > Filters).

Example: Exclusion filter to remove fake referral spam

The second step is to create a segment that removes all of the nefarious sources from historical data. While you could create your own, I highly recommend using Carlo’s nifty Clean Segment Template for Google Analytics to filter out all future referral spam as well as hostname traffic, language, network domain and service provider spam. The final segment might look similar to the following:

Example: A segment template to filter historical data using scripts provided by Carlos Escalera

Once the bots and crawlers are gone, you might just end up with a lower Bounce Rate, higher Pages Per Session, higher Per Session Value and a higher Conversion Rate. You’ll look like a rock star without spending an extra dollar.

When we apply the new segment to the data we start to get a cleaner picture of what’s really happening. In the example, below you can see how excluding 1001 sessions of bot related activity starts to improve the numbers. Outside of the bounce rate being slightly higher, all other metrics including Pages/Session, Average Session Duration, Ecommerce Conversion Rate and Per Session Value are all in fact better. The only unchanged number is Transactions which is what we’d expect to see if we were only throwing out the crap.

Example: Side by side of data with Clean segment applied

Of course it’s possible to have a false negative in the bunch so make sure you take a look at the data without the filter applied from time to time.

The end result is a much cleaner view of your actual business. What’s not to like?

7. Remove Anomalies

While referral spam and bots can mess with your numbers, you may also come across anomalies in your own data that skew any comparisons.

For instance, let’s say you decide to run a campaign that offers an inexpensive or free item with little or no transactional value. To receive it, customers have to create an account, add the item to their cart and complete an order. While this strategy can help build up your database, it also creates a large number of transactions or conversions with little or no revenue. The end result is more conversions, a higher Conversion Rate, lower Average Order Value and lower Per Session Value for the period of the campaign.

Though this data is important, especially when you’re analyzing the success of your campaigns, it may skew your view when you’re comparing month over month or year over year numbers. In this case it’s best to remove these outliers so you end up comparing apples with apples.

Consider this scenario below in which the retailer only allowed visitors to download the catalog if they created an account and hence a transaction. Though there was no cost for the catalog, the process helped to build the retailer’s email database. However, when you look at the previous numbers compared to present day, the current performance looks like a mixed bag.

Example: Month over month comparison with data containing anomalies

In fact, with 49% fewer transactions and a 47% lower conversion rate, it looks like there’s trouble on the horizon.

Of course when you drill deeper you can easily see the source for the difference.

Example: Source of anomaly showing high number of Transactions & low Avg. Order Value

Though it’s important to understand the context of the data, it’s also helpful to remove anomalies that distort the reality of current results.

By creating a new segment we’re able to remove all transactions with a zero dollar sum so we end up with a much clearer picture on how the business is trending today compared to the time when the campaign ran.

Example: True month over month comparison with anomalies removed

With cleaner data, you not only have a better benchmark for tracking trends, but you should also have more confidence that the numbers you’re sharing with management are accurate and more likely to hold up for any future testing.

Note: There is a common refrain in data analysis that you shouldn’t remove anomalies which I generally agree with. However, there are cases such as comparing period over period trends where it is not only helpful but critical to ensure you’re comparing apples with apples (just make sure you’re transparent about the data).

8. Remove Duplicate Transactions

One of the more annoying factors of Google Analytics is that once a session or action is recorded there’s no way to physically remove it from Google Analytics. This can become problematic if you allow your customers to revisit the order confirmation or receipt page more than once after their current session, thereby triggering the event and causing Google to record another transaction, as you can see here in this example:

Example: A custom report to check for duplicate transactions

Tip: You can create the same report in Custom Reports by joining the Transaction ID dimension with the Transactions and Revenue metrics.

Though Google Analytics will filter out duplicates within a single session the same is not true once your customer returns and hits the same page.

Like the bots and spam issues covered earlier, it’s best to eliminate the duplicates in order to maintain the accuracy, consistency and analysis of your data.

Ideally, you want to prevent customers from hitting the order confirmation page more than once without placing a second order, prevent them from refreshing the page and make sure all testing is conducted on a different server with a different UA ID.

Unlike the other items this fix takes a bit more technical muscle. Thanks to Simo Ahava though, he’s put together a great a step by step post on how to remove duplicate transactions using customTask in Google Tag Manager.

With a bit of elbow grease and some help from Simo, you’ll be on your way to cleaner transactions and a more accurate analysis in no time.

Do you use other tactics or hacks to wrangle your Google Analytics data? If so, feel free to drop me a note at Data Collective or Linkedin.

I’m the founder of Data Collective, a boutique agency focused on helping businesses turn disparate data sources into insights and actions that drive better outcomes. Please give this article a clap if you got some value from it.

--

--

Tom Petryshen

Ecommerce & Data Driven Executive and Mentor @ Founders Institute Vancouver & Futurepreneur Canada