Data Sampling in Google Analytics: How It Works, Why It Matters, and What to Do About It| By |Nate Dame

You are a serious marketer, so, of course, you’re diligent about metrics and analytics. You create and run custom Google Analytics reports so you can make the best decisions for your team and your company, and so you can prove the value of your programs to the boss.

But just how accurate are the numbers you get from Google Analytics?

google analytics accurate

Custom reports often require a substantial amount of data. Instead of utilizing all of the data available, Google Analytics uses data sampling. This can create inconsistencies in identical reports run by different users, and provide less than 100% accurate feedback.

To explain any possible discrepancies and continue making the best decisions, it’s important to understand Google Analytics’ data sampling.

Learn what data sampling means, how Google uses it, when it’s a problem, and how to get around it.

What Is Data Sampling?

Data sampling is a technique for forming insights on large amounts of data using a subset of the values available. It is often used when the entirety of the data is either inaccessible or too large to be considered in whole.

what is data sampling

Exhibit A: At some point, we have all tried to guess how many pieces of candy were in the jar. There are two kinds of players for these games:

  • Player 1 takes a wild guess.
  • Player 2 counts the number of jellybeans at the bottom of the jar and the number in one vertical line, and then multiplies those two values.

Player 2 is more likely to guess the actual number of jellybeans—he or she is making an educated guess using available data. In essence, they’re using data sampling.

It’s impossible to analyze all of the data. In many analytics programs—including Google Analytics—data sampling is used to conserve database space and optimize the delivery of the information, making the page load faster.

How Does Google Analytics Use Data Sampling?

Both standard Google Analytics and Analytics 360—Google’s premium platform—use data sampling, though the threshold for sampling is significantly higher in Analytics 360. According to Google, data sampling occurs in the standard Google Analytics application:

  • When running custom reports.
  • That result in more than 500,000 sessions.
  • At the view level.
  • For the specified date range.

For Analytics 360, the threshold is 100 million sessions. The premium platform comes with less sampling, more features, and more integrations—but at a steep, six-figure annual investment.

Default reports do not use data sampling, but customized reports can if they result in a significant amount of data. In custom reports, data generally samples for one of four reasons:

  1. The timeframe returns more than 250,000 sessions. Even with data at about half of what Google specifies as the session threshold, sampling is commonly used.
  2. There’s a large amount of data with an advanced segment applied. More complex segments increase the likelihood that sampling will occur.
  3. There’s a large amount of data with a secondary dimension applied. With secondary dimensions, data becomes more complicated, increasing the chance that sampling will occur.
  4. You’re looking at a flow report. Reports like Users Flow, Behavior Flow, Events Flow, and Goal Flow have smaller session thresholds, allowing for a maximum of 100,000 sessions for the selected date range.

It’s easy to identify when Google is using sampled data. Above every report in Google Analytics, there is a line that says, “This report is based on X% of sessions.”

analytics sampling sessions
If the percentage of sessions is less than 100%, the data is being sampled.

How Does Data Sampling Impact the Accuracy of Google Analytics Data?

Going back to the jellybean example, Player 2 will be more likely to guess values closer to the actual number of jellybeans in the jar, but probably still won’t guess the exact value. Why? Jellybeans are oval, so they do not sit neatly on top of each other—the number in each row is likely inconsistent.

While counting and multiplying jellybeans results in an educated guess, in the end, it’s still just a guess—albeit one with a higher probability of accuracy. Is the same true for Google Analytics reports that use sampled data?

To test the accuracy of data sampling in Google Analytics in the past—when sampling seemed high for a month’s worth of reporting—I ran 30 one-day reports that sampled a lot less. Then, I compared the data in the monthly report with the sum of data from the one-day reports.

What I found is that sampling of 30% or higher is reliable, where you have samples with large buckets. On the other hand, if you wanted to know, for example, how many Linux users you have in San Mateo, it’s less accurate. Big buckets are still relatively accurate between 15 and 30%. It’s below 15% that the accuracy of the data becomes questionable. — Marissa Goldsmith, IQ Certified Google Analytics Expert

As long as the sampling percentage is 15% or greater, it’s probably fine to view the data as accurate enough to make good decisions. But if the information needed is sampling less than 15% of data, it’s time to consider whether you need some workarounds.

Another red flag is when you notice that you’re looking at small numbers, and they are all the same.

Again, sampling on your big-bucket items, like traffic sources, is generally okay. When you’re looking at PDF downloads, and you notice that your bottom 20 PDFs all have 12 downloads—that’s sampling. — Marissa Goldsmith

data sampling

Does Analytics Sampling Matter?

Should you be concerned by data sampling in Google Analytics? Possibly, but it depends on how you use the platform and how much traffic your properties get. If you only run default reports, you probably don’t need to be concerned. If traffic volumes are low, it’s unlikely that you’ll query a custom report that exceeds the session threshold.

But if you’re dealing with a high-traffic site, and using Google Analytics to run detailed custom reports, it’s important to pay attention to the percentage of data that’s being used. Data sampling causes different users to get different data sets when running reports, which can be difficult to explain to leadership.

hide bad data

While sampled data from Google Analytics is usually accurate enough for gathering insights and analyzing trends, it can cause issues when creating reports that will be double-checked.

Workarounds for Google Analytics Sample Data

If Analytics is sampling and you need more accurate results, consider these workarounds:

  • Pull smaller data sets. Instead of large date ranges, run several one-day or one-week reports (depending on the quantity of data being accessed), and sum them to get more accurate data. Run the reports in Google Sheets using the Google Analytics Spreadsheet Add-On to query and report data from multiple views. The add-on uses the Google Analytics API, which has the lowest percentage of data sampling.
  • Filter buggy data in Google Sheets. Using the Google Analytics Spreadsheet Add-On, run huge reports with lots of dimensions. The plugin will still sample, but you can see the sampling rate, and adjust accordingly.
  • Pay for a third-party tool. There are several third-party tools that do the two processes mentioned above, but offer a better user interface. Some popular tools include Analytics Canvas and Tableau.I definitely recommend something like Analytics Canvas, and—to a lesser extent—SuperMetrics (which is great not just for GA, but its use in other tools). Tableau can do this, but it’s not its raison d’etre: it’s a data crunching and visualization tool, and it will be a bit pricey. — Marissa Goldsmith

Ultimately, these are all hacks. The only surefire way to get rid of sampling is to upgrade to Analytics 360. But if it’s a problem that the occasional hack can handle, the workaround is usually ideal when the alternative has such a heavy budget impact.

The Implications of Google Analytics Data Sampling

While it’s probably not an issue to measure trends and gather insights using sampled data, it’s important to be aware of data sampling. When running reports that will be reviewed, take note of the percentage of data being sampled. If it’s less than 15%, and if sampling is an issue for your company or your reporting goals, use a workaround to validate the data.

The workaround may be time-consuming, but it’s more practical than upgrading to Analytics 360 for most businesses. The time spent is worth it in the end when you can rest assured that you’re providing accurate and comprehensive reports—reports that can be explained in impressive detail if a client or leader comes to you later with questions about discrepancies.

via Technology & Innovation Articles on Business 2 Community http://ift.tt/2qAQv0t

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s