Geo-Testing Series Part 3: Analytics & Reporting
Hello again, Growth Marketers. I’m Andrew Covato, and I’ve spent my career building and deploying incrementality-based measurement and optimization systems for major ad buyers and sellers. I now work with performance marketers, helping them build out scientific growth programs. I also work with measurement platforms, but exclusively with ones that are doing smart things–namely, that are helping scale incrementality measurement–just like Measured. Check out a bit more about me at growthbyscience.com.
Geo-Testing Series Part 3: Analytics & Reporting
This is the conclusion of my three-part blog series on geo testing. Previously, I wrote about test setup and test management. Today, I address our most complex topic: analysis. This topic can get very nuanced, so I will do my best to keep it at the highest level. Your main takeaway here should be that it takes a lot of additional work to marginally improve the quality of a test analysis, but as your organization grows, those marginal gains become absolutely crucial to understanding your marketing performance. For example, a startup running $500k of media spend a year only needs to know (and likely only can know) directionally if their investment is ROI-positive, whereas a multinational organization with a $1B performance budget must understand their incrementality to a very tight degree of precision. Remember: even just 1% of $1B = $10M…so a wide margin of error would generate results that swing by huge sums of money. So let’s dive in…as before, we’ll break down analysis into smaller components to discuss individually:
- Calculating incrementality
As a quick refresher, incrementality refers to the lift in KPI that can statistically be shown to have been caused by marketing efforts. Incrementality is also referred to as “causal impact” or “lift,” but they all mean the same thing: how much of KPI X resulted from my ads vs. how much would I have gotten anyway? Inherently, to calculate incrementality, you must take a difference between “something that had ad exposure” versus “something that did not have ad exposure” or “something that statistically represents a lack of ad exposure.” The most simplistic way to accomplish this is with an old-school matched market test. With this method, you would use geo characteristics (like population, age distribution, average income, etc.) to find “similar” geographies. You would then treat some geographies with ads and not treat others (as we described in Part 1 [ADD LINK]). You could then just take the difference of the observed KPI in those two market groups as a proxy for incrementality. The rigor by which markets are “matched” can be high…or very low/non-existent (e.g., geos are simply randomized). This is wrought with a ton of uncertainty and requires a lot of weak assumptions, and I would personally not recommend this approach. A more robust way to estimate lift would be to use “difference-in-differences” (DID) analysis. A great deep dive into DID and the various ways to implement it can be found here. To use DID with ads geo testing, you would need to find two markets that track closely in terms of their historic trends of the KPI you’re interested in measuring and follow some of the assumptions in the link above. Typically, you will create a regression model for the KPI (with terms corresponding to treated/not, time, and other relevant characteristics). The regression coefficient of the interaction between time and treatment will be an estimator of the effect of the ads. Yup…it can get complex quick. The absolute best way to perform a geo test (and which has since become industry standard with open source implementations from Meta and Google) is to employ a predicted counterfactual–this is exactly how Measured runs tests. In this method, you would designate some geographies as test markets and model out synthetic untreated “versions” of the test markets based on actual observed data from control markets. You end up comparing the real observed behavior in the test markets to the modeled versions. To supercharge this methodology, you would need to optimize which markets to test into, which terms to include in the model, and the calculation confidence intervals. All of these tasks are best accomplished with some type of iterative simulation. Tools, like Measured, automate this highly complex process.
To ensure that you can trust experimental results, you need some level of validation that the experimental setup was accurate. It’s virtually impossible to do that with any degree of rigor if you’re just comparing random test/control markets directly. If you’re using DID, you can perform a type of resampling to get a distribution of your estimator, from which you can calculate a p-value (which, loosely speaking, is the probability that the results you are seeing are due to pure chance). However, with a predicted counterfactual, there is more flexibility. For example, you can perform retroactive “A/A” tests in which you model out a synthetic control for a period of time for which you have observed untreated data in your test markets. This lets you check how well the models are and how likely you are to detect a “fake lift.” You can also use mean average percentage error (MAPE) as a gauge of how well your synthetic control reflects the test markets. These validations are native to Measured and implemented in an easy-to-use fashion, and confidence intervals (range of possible results you’d see 95% of the time) are presented clearly. Note: when your lift is outside the confidence interval, the results are termed “statistically significant” (i.e., low probability they are seen by chance).
Running ad hoc tests inherently means you must re-analyze the test and generate a report each time. Ironically, the more rudimentary your test methodology is (e.g., DID or simple delta analysis), the more custom validation you must do to gauge how trustworthy your results will be. I am a huge fan of standardized, templatized reporting for tests, as this allows an organization to build a repository of results to continually learn from. TL/DR
- There are lots of ways to run a geo test: matched markets, difference-in-differences, or predicted counterfactual. The latter is the industry standard these days and is implemented natively by Measured.
- No matter what methodology you use, validation and calculation of statistical significance are key.
- Always seek to implement a system or use a tool that provides standardized test results and analysis.
That wraps up our geo-testing series. Be sure to also check out parts one and two on test design and test management. Geo-testing is a powerful tool for revealing the true impact of your media investments, but not all tests are created equal. Download our ebook, 'The 3 Essential Steps to Geo-Testing Like a Pro' to learn everything you need to know about geo-testing today.