SB
Experimentation

An AI Marketing Experimentation Framework That Actually Ships

By Sebastian Behar··7 min read

Most of the AI experimentation advice I read is written by people who don't actually run experiments. It tends to be all velocity and no rigor: generate a hundred variants, ship them, and let the model sort it out. That's not a program, it's gambling with extra steps.

I've spent the last decade running experimentation as a discipline across high-growth companies, where it helped drive major revenue expansion, and in earlier growth roles before that. What follows is my actual ai marketing experimentation framework: where AI genuinely compresses the lifecycle, where it quietly wrecks it, and the rules I use to keep it honest. This is the process I run now, not a generic guide.

Why an AI marketing experimentation framework needs guardrails first

Over time I've come to treat AI as a force multiplier on whatever discipline you already have. If your experimentation is sloppy, AI makes it sloppy faster. If it's rigorous, AI lets you be rigorous at a scale you couldn't staff before. The growth-as-a-system thinking that writers like Elena Verna have popularized applies cleanly here: AI is leverage on the system, not a replacement for it.

The failure mode I see everywhere is teams using AI to skip the thinking. They prompt for "10 high-converting headline variants" and ship them with no hypothesis behind any of them. Even if a variant wins, you've learned nothing transferable, because you can't explain why, which means you can't reuse it.

Experimentation isn't really a variant-generation problem, it's a learning problem. The best programs treat the whole thing as a learning system rather than a variant factory: every test feeds a compounding body of knowledge, and each hypothesis ladders up to a single North Star metric so you can tell which wins actually move the business. The output of a good test isn't a winner, it's durable knowledge about your audience. AI is dangerous precisely because it makes the cheap part, producing variants, nearly free, which tempts you to underinvest in the expensive part of knowing what you're trying to learn. So I built the framework around protecting that part: AI accelerates every stage but never skips the hypothesis. It's the experimentation-specific version of the systems mindset I describe in building marketing systems with AI tools.

Where AI actually speeds up the lifecycle

There are three stages where AI earns its place, in roughly ascending order of leverage:

  • Ideation against a hypothesis. With a sharp hypothesis, say "enterprise buyers don't convert because the page leads with features, not outcomes," I use Claude to expand the variant space. Not "give me headlines," but "given this hypothesis and audience, what five distinct framings would each test it differently?" The constraint is what makes the output useful.
  • Variant generation at scale. For lifecycle work in Braze, or landing-page tests in Optimizely and Mutiny, AI drafts copy and structure quickly. What used to take a half-day now takes about an hour, including my edits.
  • First-pass analysis. I pipe results from Amplitude and Looker into Claude to summarize segment cuts, flag where lift concentrates, and draft the readout. It works as a fast second analyst that surfaces things I'd miss on a tired Friday afternoon.

What ties these together is that AI accelerates execution inside a structure I define. It never sets direction; the hypothesis, the metric, and the call on what "good" means all stay with me. The tooling matters too, and I break down the full stack in my marketing tech stack.

Where AI actively hurts if you let it

Each of the speed-ups above has a shadow version that quietly degrades your program, so I treat these as known hazards rather than surprises.

Regression to the generic. AI copy converges on the safe middle, and left unchecked, your variants all start to sound like the same competent stranger. The whole point of a test is to probe a real difference, so I deliberately push variants apart, because two near-identical options just waste traffic.

Confident, wrong claims. A model will happily write "the #1 platform" or invent a statistic on the spot. In a regulated environment, where real compliance constraints apply, an unverified claim in a live variant is a legal problem, not a copy problem. Every claim gets sourced before it ships.

Analysis that flatters the hypothesis. This is the subtle one. Ask AI "did my variant win?" and it will find a way to say yes, because if you slice the data enough ways, something is always significant. AI makes p-hacking effortless and polite, and that's the failure I guard against hardest.

AI makes the cheap part of experimentation free and the dangerous part effortless. The framework exists to protect the expensive part: knowing what you're trying to learn.

Sebastian Behar

How I read AI-assisted analysis without fooling myself

The analysis stage is where good programs tend to go to die, so I run it with hard rules. The core discipline is pre-registration: I decide what would prove the hypothesis before I see a single result.

  1. Pre-register the primary metric and minimum effect size, written down before launch and not negotiated after.
  2. Make AI argue the null: "Make the strongest case that this result is noise." If the win survives its own best counterargument, I believe it more.
  3. Treat segment cuts as hypothesis-generating, never confirming. An unpredicted lift is the next test, not this test's conclusion.
  4. Check the practical number, not just the p-value. A significant 0.3% lift on a low-traffic page is often a rounding error dressed up as a win.
  5. Tie every readout back to the hypothesis. If the honest answer is "no, but we found something else," I say so plainly.

Using AI as a skeptic instead of a cheerleader has been the highest-leverage habit I've added. The model attacks a result about as well as it defends one, but most people only ever ask it to defend. That discipline, rather than volume, is how you get a conversion-rate lift that holds up, or incremental revenue you can actually attribute, outcomes you can defend in a room full of skeptical executives because you know exactly why they happened.

Running the loop in practice

A week tends to look like this. On Monday I scope two or three hypotheses, each with a clear primary metric, usually off a session-replay pattern in FullStory or a drop in a Looker funnel. AI sharpens the wording, but I own the bet. Tuesday and Wednesday are for generating and hardening variants: AI drafts, I push them apart, and then every variant passes a human gate for brand voice, claim sourcing, and accessibility. Nothing ships that I haven't read line by line, because that review is the cost of moving fast without breaking trust.

Then the test runs to its pre-committed sample size. No peeking and stopping early when a number looks good, which is the oldest way to fool yourself. When it ends, I parse the results with AI as my second analyst and designated skeptic, write the readout, and file the learning whether we won or lost. A losing test that teaches you something true is worth more than a winning test you can't explain.

That's the whole framework. AI compresses the lifecycle from days to hours, but it compresses execution, not judgment. Keep the hypothesis sacred, make the model argue against you, and ship the learning rather than just the winner. Do that, and AI becomes a compounding advantage instead of a faster way to be wrong. For how I wire this into a growth org, start with Sebastian Behar.

ShareLinkedInX
Sebastian Behar
Written by Sebastian Behar

AI-Driven Growth Leader and Director, Growth Marketing & Analytics at Coursera. I build scalable growth systems across consumer, enterprise, and product-led motions — now powered by agentic AI. More about my work →