What Is A Holdout Test in Marketing?
Unlocking the Potential of Holdout Tests in Marketing Strategies
Intro to Holdout Experimentation: A Marketer's Tool
In a world where marketing departments are increasingly asked to do more with less, no brand can afford to spend precious budget on channels that are not driving a true causal lift to sales.
We know that all media vendors claim to drive an outsized number of sales, a portion of which would “likely have happened anyway,” even without that media channel being active. The question is, how many? How can we quantify the true incremental contribution of a media channel to the business?
This is the basis of “incrementality” as a marketing concept. Holdout experimentation (or holdout testing for short) is the primary weapon in the incrementality arsenal and allows brands to reveal a true picture of marketing performance.
In essence, a holdout test answers the question: “How many sales would I lose if I removed this media from my portfolio?” This “counterfactual” statement is the academic definition of causality as it applies to marketing and is the basis for all holdout testing.
For a full rundown of incrementality as a concept and capability, read our FAQ: What is incrementality in marketing?
The Essentials of Setting Up a Holdout Test
A Holdout Test removes (or “holds out”) a media from a selected audience, and observes their conversion behavior vs. a control audience who continues receiving that media as per usual.
*Note: the exposed group is referred to as a “control” in this case because they represent the typical Business as Usual scenario of the media being active.
One key feature of incrementality testing (as opposed to on-platform Lift Studies or brand Lift Studies) is that results are based solely on a brand's 1st-party transaction data.
This is critical because the core priority in experimentation is to measure a media channel’s net impact on a brand's sales, including all the interactive effects the media may have on other media channels, organic channels, etc.
Selecting a Tactic to Test
The first and most obvious step is to select the channel or tactic that you’d like to measure. This has implications for subsequent steps, as different channels are best measured with slightly different techniques (more on that later).
Generally speaking, the best place to start holdout testing is with the tactic receiving the highest spend in your portfolio. This is where informed actions resulting from the holdout experiment are likely to have the biggest business impact, simply due to the portion of the budget being affected.
Another consideration is the range of hypotheses of performance for a given channel. For example, there is almost no world in which an email program is unprofitable, regardless of its incremental contribution, simply due to how cost-effective it is to run. As a result, email is usually not a priority for testing.
On the other hand, social retargeting campaigns often take up a significant portion of the marketing budget and can perform wildly differently depending on specific targeting details (audience) and campaign execution details (like triggers, timing, frequency, etc.). This makes social retargeting a common priority for experimentation.
For certain platforms with automated targeting algorithms (e.g., Google Performance Max or Meta Advantage+ Shopping), special considerations need to be taken into account with channel/tactic selection. For a detailed overview of this topic, see here.
Duration and Time of Year
The next key consideration is the time of year and duration of your holdout experiment.
Brands instinctively know that media performs differently throughout the year, depending on factors like seasonality, promotional events, etc. This is why it’s important to carefully select time periods that are representative of the business at large or to test the same tactic at multiple points during the year to quantify how these factors impact incrementality.
Additionally, the duration of an experiment should be considered prior to launching. Generally speaking, 30 days is a good standard to realize the full near-term causal impact of a marketing tactic, but for some brands with longer consideration periods, this is not enough.
A good rule of thumb is to run your experiment for at least the length of your average consideration period to ensure you give results time to react to the delayed impact of removing media from an audience.
Minimum duration can also be partially informed by Statistical Significance exercises (more on that later).
Establishing Control and Test Groups
There are several methods for “splitting” an audience into representative Test and Control groups:
A known-audience split takes a representative group of individual users from an existing user list and withholds a given media tactic they would otherwise receive. This is only possible for media channels where user-based targeting is available, typically CRM-based channels such as E-Mail, Catalog, SMS, etc.
When designing a Known-Audience split, the main factors to control for are recency, frequency, and monetary value of recent purchases and a user's eligibility (e.g., opt-in vs. opt-out) to receive the media in question or any other related media.
A geo-split is used when an audience is unaddressable, meaning that individual users cannot be targeted directly from a pre-existing list, and therefore, a known-audience split isn’t feasible. This applies to any channel that employs broad targeting like social prospecting, CTV, paid search, and most retargeting tactics.
The geo-split method identifies specific markets within a broader region (e.g., states within a country) that are statistically representative of that broader region and groups them together. The media in question is then removed from these test markets, and conversion behavior is compared to a group of control markets (business-as-usual).
When designing a geo-split, the main factors to account for and control for are a market’s sales trend and seasonality, population conversion rate (i.e., market penetration), and media relevancy (historical execution of the media in question being representative of the broader region).
Holdout Tests vs A/B Testing: When Do You Use Each?
Marketers often confuse holdout testing with A/B testing, but these are not the same thing. Holdout tests remove a media to evaluate its “absolute” contribution to the business, while A/B tests serve two different treatments to similar audiences to see which one drives better performance.
What are the Advantages of Holdout Tests Over A/B Testing?
The key advantage of holdout testing is that it measures the absolute incremental contribution of a media to the business. In other words, it tells you if a media provides a profitable return to the business overall, informing whether it should be further funded or not.
A|B Testing does not provide such insight but instead measures the relative performance of two different treatments, all else being equal. This is useful for optimizing messaging and creative within a channel but will not tell you how that channel is performing as a whole.
Scenarios Favoring A/B Testing
As mentioned above, A|B Testing is best used to optimize campaign performance within a given marketing channel. This generally falls into two categories:
Creative Optimization: Serving two different creatives to likewise audiences simultaneously, with the same campaign settings, to observe which creative is more compelling (i.e., which creative drives the higher conversion rate).
Channel Optimization: Varying a certain campaign setting between two likewise audiences while holding all else equal, for example:
- Bid Types
- Optimization Events
Such “head-to-head” testing informs how to maximize performance or efficiency within a given channel but doesn’t inform total budget allocation between and across channels, which is the end goal for holdout testing.
Analyzing Results: How Do I Interpret Data from Holdout Tests?
There are two main methods for calculating the results of a holdout test:
1. Pre-Post Analysis
Also known as a “Difference in Differences” and by various other names, a Pre-Post analysis selects a reference period before the test and then calculates the “difference” in sales between the experiment period and the reference period for both test and control markets.
These differences are then compared to see if the test (holdout) group “lost sales” compared to the control group.
While theoretically, this works, week-to-week or month-to-month noise and volatility within markets make this an incredibly unstable technique that can yield unreliable results.
2. Baseline Estimation
In causal inference, the factual outcome is the outcome that is observed based on a treatment and the counterfactual is the outcome that is not observed absent the treatment.
Since the counterfactual is unobserved, we use statistical machine learning to estimate it. The counterfactual prediction model estimates the expected sales in the test markets had media not been removed – essentially baseline sales. The baseline is then compared to the observed sales when the media was removed to arrive at the estimated effect of the media.
There are various statistical techniques that can be used to create the baseline, but the general idea is to employ a regression model that leverages a few years of historical sales data in the control markets to “predict” likely sales in the test markets for any given time period.
The more representative test markets are of the broader region, the more accurate this prediction will be, hence the critical importance of market selection.
Understanding Statistical Significance
When an experiment is completed, the counterfactual prediction model uses the sales of the control markets during the experiment period to estimate the likely baseline sales in the test markets for the same period had media not been removed.
We then attempt to infer that the difference between observed sales during the experiment period and the baseline sales is the incremental contribution of that media to the business, as no other variables were altered during the experiment. This inference is then established with statistical significance.
A hypothesis test on the incremental contribution is employed to check its statistical significance. Essentially, we want to know the chances of the estimated incremental contributions occurring randomly, i.e., without the media intervention/treatment.
When the chances of random occurrence are low as compared to our risk tolerance to making an error in using the results in decision-making, we conclude that the results are statistically significant, and we have a high confidence in attributing the incremental contribution to the media.
Making Data-Driven Decisions
Ultimately, if a test doesn’t directly inform an investment or optimization decision, it amounts merely to an expensive science project.
Test results must be applied to an attribution framework (systematic, ongoing attribution of sales to media, enabling ROAS(i) calculation and allocation optimization) in order to be valuable.
Check out this FAQ on optimizing ad spend for more on how to leverage test results to drive a broader incrementality capability.
Integrating Holdout Tests into Your Marketing Toolbox
Holdout testing has become a must-have capability for brands looking to prove the value of their marketing budget to the business and make the most out of limited resources.
For more information on how Measured enables this capability for hundreds of brands, speak to one of our measurement experts today.