I've trained a lot of marketers on experimentation over the years, and the same thing always surprises them: the hard part is almost never the statistics. The marketing experimentation foundations that separate teams who learn from teams who just run tests are mostly about discipline, a clear hypothesis, one honest metric, and the judgment to know when an experiment is even the right tool. The math is the easy part.
This is the primer I wish more teams started with. It's the foundation underneath everything else I write about, including how I now layer AI into the process in my AI experimentation framework. Get these fundamentals right and the fancier techniques compound. Skip them, and no tool will save you.
Why experimentation beats opinion
The honest case for experimentation is that our hunches are wrong more often than we'd like to admit. A test does two things at once: it gives you statistical rigor, and it lets you understand the impact of a change before you commit real resources to scaling it. That second part is underrated. Experiments are how you find the right strategy and actually know you've found it, rather than betting the quarter on a confident guess.
There's also a quieter organizational benefit. A culture of testing keeps the loudest or highest-paid voice in the room from automatically winning, because the data gets a vote. The line I still come back to captures it better than I can:
If we have data, let's look at data. If all we have are opinions, let's go with mine.
— Jim Barksdale, former CEO of Netscape
It's worth staying humble about the hit rate, too. The research on online experiments, most thoroughly collected in Ronny Kohavi's book Trustworthy Online Controlled Experiments, is sobering: even at the most sophisticated companies, only a minority of experiment ideas produce a clear win. No matter how obvious an idea seems or how many competitors are doing it, a lot of them simply fail. That's not a reason to test less. It's the entire reason to test at all.
Correlation isn't causation, so don't just ship and watch
The most common shortcut is to launch a change, watch the metric, and call the movement a result. The problem is you can't separate your change from everything else happening that week. Imagine launching a new offer in a single market during a major holiday and watching transactions fall. Did the offer fail, or was everyone just busy celebrating? Without a control group, you genuinely can't tell, and you'll draw the wrong lesson with full confidence.
This is why controlled experiments matter. Randomly assigning users to different experiences is what lets you attribute a change in behavior to the thing you changed, rather than to timing, seasonality, or a dozen other moving parts.
A/B testing is the gold standard, when you can run it
A randomized A/B test is the cleanest way to establish causality, and you do not need an expensive platform to start. The principle is randomization, not the software. A sales team can test two pitches across a hundred calls and log the results in a spreadsheet, and that is a legitimate experiment. The tooling makes it scale; it doesn't make it valid.
The catch is that a true A/B test isn't always feasible. There are several situations where randomization breaks down, and recognizing them early saves you from running a test you can't trust:
- Not enough volume to randomize meaningfully, which is common in low-traffic or high-touch sales motions.
- Reputational or ethical concerns, where it would be unfair to give some people a materially different experience.
- You don't control who gets which experience, as with most organic search.
- You can't tie outcomes back to the randomization, which happens with some paid-media setups.
- Interference between groups, where people who know each other land in different variants and influence each other's behavior.
What to do when you can't A/B test
Not being able to randomize doesn't mean you're flying blind. It means you reach for a weaker but still useful method, and you're honest about its limits.
- Pre/post comparison: measure the metric before and after the change, but only trust it if nothing else changed at the same time, and remember a small-audience change may not show up sitewide.
- Response from affected users: track just the users a change touched, which works only if those users weren't already active, otherwise you still need a control to make any real claim.
- Geographic holdouts: run different experiences in comparable markets and measure the difference-in-differences, with a pre-check that the markets really are comparable.
- Research and external data: user research, surveys, and pattern libraries like GoodUI can support a hypothesis or surface problems before you ever build anything.
The foundations of marketing experimentation start with the hypothesis
If there's one place to spend your discipline, it's here. A good hypothesis can be written as "if we do X, then Y will happen," it can be clearly supported or refuted by the test, and it has a clear causal link between the change and the expected impact. Crucially, it commits to one success metric with a measurement plan defined in advance, before you've seen any data and can talk yourself into a different number.
Underneath that, three questions need real answers. What exactly are you changing, stated granularly enough to isolate the effect but meaningfully enough to expect a result, because a marginal copy tweak rarely moves anything readable. What is the expected impact, stated specifically and measurably, and big enough to be worth your time. And who will be affected, are they the right population, and is the group large enough to be interesting.
How I prioritize what to test
You will always have more ideas than capacity, so prioritization is where a lot of programs quietly succeed or fail. I weigh three factors, which will feel familiar to anyone who has used ICE or RICE: upside, confidence, and effort. Upside is the real prize, and it's worth breaking it into reach (how many people the experiment touches) and impact (how much it moves the variable you care about), because the total upside on the growth model is the product of the two.
- Upside: reach multiplied by expected impact, measured against the growth model rather than a vanity number.
- Confidence: how likely the test is to succeed, based on evidence and prior results, not enthusiasm.
- Effort: the marketing, product, and data resources it will actually take to ship and measure it.
The point of scoring isn't false precision. It's to force a consistent conversation so the biggest, most winnable bets rise to the top instead of whatever someone happened to feel strongly about that morning.
The fundamentals are what compound
A few habits separate a real experimentation practice from a pile of one-off tests. Build experiments around themes that connect to a growth vision rather than scattering them. Accept that small, consistent wins are what keep marketing programs healthy over time. Treat culture as at least as important as tooling, because the best platform in the world won't help a team that doesn't ask good questions. And document and share results widely, wins and losses alike, so the organization builds a compounding narrative of what it has learned.
That last habit is the one I'd protect most. An experiment that teaches you something true, even when it loses, is worth more than a win you can't explain. These foundations are also exactly what make AI useful in the loop rather than dangerous, which is the subject of how I run experiments with AI, and they sit underneath the broader way I think about building marketing systems. If you want more of how I approach this kind of work, that's most of what I write about on my homepage, Sebastian Behar.
