Lesson 3.8 · StrategyGuide · 10 min readFree · No signup

Observer-Expectancy Effect: researchers influence what they find

Part of the Psychology of Design learning path. The cognitive biases and psychology principles behind every click, scroll, and conversion.

L3 · How people act over time · Lesson 8 of 2610 min read for this one

What you'll understand by the end of this lesson

  • Why A/B tests need pre-registered hypotheses before the data is seen
  • How expectation bias changes which results get reported and which get buried
  • Why the person running the test shouldn't be the one deciding when to stop it
  • How to design a testing process that protects results from the researcher's own beliefs

The principle in plain English

When a researcher expects a particular result, they unconsciously influence the experiment in ways that make that result more likely to appear.

This is the Observer-Expectancy Effect — also called the Experimenter Bias. It doesn't require dishonesty. The researcher doesn't need to fabricate data. The influence is subtler: in how the test is set up, which variation wins the benefit of the doubt, when the test is called complete, and which results get shared with stakeholders.

The experiment reflects the researcher's expectations not through deliberate manipulation but through dozens of small decisions, each of which feels neutral in the moment and each of which adds up to a biased result.


A simple example

A CRO specialist is excited about a new headline they've written. They set up an A/B test and watch the early results. After 4 days, the new headline shows a 12% conversion lift. They're excited — this confirms their thinking. They stop the test and report the win.

What they didn't do: wait for statistical significance. The sample size was too small. The 12% was noise, not signal. But the desire to confirm the hypothesis led to a decision to stop before the data was reliable.

This is one of the most common forms of Observer-Expectancy Effect in CRO: peeking at results and stopping when the number looks good — not when the methodology says it's complete.


Pre-registered hypotheses

The most powerful protection against Observer-Expectancy Effect is the pre-registered hypothesis: writing down exactly what you expect to happen, and why, before the test begins and before any data is seen.

A pre-registered hypothesis looks like:

"We believe changing the CTA from 'Get started' to 'Start your free trial' will increase trial signups by 10–15%, because the specificity of 'free trial' removes ambiguity about what users are committing to. We will run the test for a minimum of 2 weeks or until we reach 95% statistical confidence, whichever is later."

This does three things:

  1. It forces clarity about the mechanism — why should this change work?
  2. It sets stopping rules before the data can influence them.
  3. It creates a public record that makes cherry-picking results harder.

Pre-registering a hypothesis doesn't require formal academic process. A shared document that the whole team can see, written before the test launches, is enough. The key is that the expected result and the stopping rule are recorded before any live data is observed. Once you've seen the numbers, your brain has already been influenced by them.


Premature stopping and peeking

The most common structural mistake in CRO testing is checking results daily and stopping when the numbers look good. This is called "peeking" — and it reliably produces false positives.

In a correctly run test, early results are noisy. Conversion rates fluctuate significantly in the first few days as different user segments see the test. An early 15% lift can reverse to a 5% decline by the end of the test period as the sample becomes more representative.

A researcher who peeks and stops when the number is high has not run an A/B test. They've run an exercise in confirmation — and they will report a result that the data doesn't actually support.

The fix is mechanical: set the sample size and duration before the test begins, and do not look at results until those thresholds are met. This requires discipline, especially when early numbers look compelling.


Who should decide when to stop?

The person who has the most riding on a particular result should not be the person deciding when the test is complete.

If a product manager proposed the change and is emotionally invested in the outcome, they will unconsciously apply different criteria to "is this done?" depending on whether the result confirms or refutes their hypothesis. They will find reasons to extend a test that is showing a negative result, and reasons to call early a test that is showing a positive one.

This is not dishonesty. It is human cognition. The solution is structural: stopping rules should be pre-agreed, and ideally enforced by someone not directly invested in the outcome.

Selective reporting is a version of Observer-Expectancy Effect that compounds over time. Teams that report wins and quietly bury losses build a false picture of what works. Over months, decisions get made on accumulated false positives. The fix is to document every test result — positive, negative, and inconclusive — in a shared log, with the hypothesis, the sample size, and the statistical confidence level recorded alongside the result.


The CRO audit

Look at your testing process and ask:

1. Are your hypotheses written before the test launches?

If your team writes hypotheses after seeing early results, or only writes them for tests that succeeded, Observer-Expectancy Effect is already in the system. Introduce a shared pre-test log that must be completed before any test goes live.

2. Are your stopping rules pre-set and followed?

Check your last five tests. How many ran to their pre-agreed sample size or duration? How many were stopped early because the results "looked good"? If the answer is that most were stopped based on gut feel rather than pre-agreed criteria, your test results are unreliable.

3. Do you have a log of all tests — not just the ones that won?

A test log that only contains winning results is a selection bias problem. Every test, regardless of outcome, should be recorded. A team that can't answer "what percentage of our tests produce a positive result?" doesn't have a testing programme — it has a cherry-picking programme.



Q1

A CRO analyst checks an A/B test after 3 days. The new variation shows a 14% conversion lift. They stop the test and report it as a win. What is the most significant problem with this decision?

Think about this

You've seen how expectations bias research. Now — what about changes themselves? When is a change too small to notice, and how can you use that to introduce new things without users pushing back?