Uncovering the Costly Bias in Marketplace Testing

A/B testing can be a relatively quick and cost-efficient tool for leaders and their companies to test new features on a subset of users to understand the impact before broader deployment. However, this testing can come with a serious caveat in many industries.

Imagine you're testing a new feature on your website — the impact of showing better-quality photos for rental listings on a platform like Airbnb. You randomly split users into two groups in preparation for an A/B test: a treatment group sees new, high-quality photos, while a control group sees the original, standard images.

In a perfect world, each user's behavior would be unaffected by what the other group sees. But that assumption often breaks down in reality, especially in marketplaces or social networks. According to research from Hannah Li, an assistant professor in Columbia Business School's Decision, Risk, and Operations Division, users don't operate in isolation- they interact, compete, and influence each other.

"When you run A/B testing in marketplaces where you have users buying and selling things from each other, the users are no longer going to be independent," Li says.

Key Takeaways:

-Traditional A/B testing assumes a type of user independence, i.e. that the treatment assigned to one individual does not influence the behavior of another.

-In platforms like marketplaces or social networks, this assumption often fails because users interact, compete, or influence one another, creating interference bias.

-As a result, companies risk wrongly rolling out or rejecting features, all while believing they're making sound, data-driven decisions.

-Smarter experimental designs, such as Two-Sided Randomization, can reduce bias.

-Other forms of biases can arise in recommendation systems, where users can strategically interact with their recommendation algorithms by deliberately changing how they engage with content.

Preventing Statistical Bias

Li explained that when someone in the treatment group books a listing due to the higher-quality photos, there's now one less listing available for someone in the control group. This means the treatment unintentionally affects the control group, violating a core assumption of A/B testing: independence.

That distortion is what Li and her fellow researchers call interference bias – an occurrence that can bias as high as 230%, meaning companies might believe an intervention is more than twice as effective as it is. That can lead to false confidence in a product change — launching something you think is a success, only to find it doesn't work in the real world. Worse, it might cause you to kill ideas that would've worked simply because your experiment didn't account for how users affect one another. All the while, a company believes they are making air-tight, data-driven decisions.

In their research, Li and her co-researchers found that implementing the right experimental systems can curtail this bias.

Interference in Action

To investigate how interference bias arises in two-sided platforms, the researchers developed a formal marketplace model using continuous-time Markov chains. This mathematical framework allowed them to simulate a dynamic environment where buyers and sellers arrive, interact, and transact over time.

Li and her co-researchers found that preventing this bias can be done through a novel form of experimental design, known as Two-Sided Randomization (TSR). TSR randomizes both sides of the marketplace simultaneously. Instead of randomizing either sellers or buyers to treatment or control groups, TSR randomizes both sides, sellers and buyers, to these groups. This type of design allows the platform to measure competition effects between sellers and between buyers, the source of the interference bias, and account for these effects in the experiment estimates.

This leads to far more accurate estimates of an experiment's Global Treatment Effect (GTE) — the metric most companies care about when deciding whether to roll out a feature to all users. Simulations from Li and her co-researchers' paper show that TSR consistently produces lower bias than standard experimental methods, across a wide range of market conditions.

If TSR is not feasible, there are other approaches companies can take, according to Li. Cluster Randomization, for example, groups users (e.g., by region) and randomizes them to minimize cross-group interaction.

Another technique is Switchback Testing. Instead of splitting users into a control group and a treatment group, alternate the treatment across time periods for the entire platform (e.g., on one day, off the next).

When Users are Strategic

A subsequent paper by Li studies how systems of people strategically interact with online platforms to influence recommended content—another form of bias that can throw companies off.

Typically, platforms like TikTok, Netflix, and Amazon suggest content based on users' past behaviors, assuming user interactions are straightforward reflections of their preferences. However, Li and her co-researchers' study suggests that users often engage in strategic behavior to shape their future recommendations.

For instance, when participants were informed that an algorithm prioritizes "likes" and "dislikes," they used these features almost twice as much as those told the algorithm focuses on viewing time. Through surveys, the researchers found that nearly half of the participants admitted to altering their behavior on platforms to control future recommendations. Some users even reported avoiding content they enjoy to prevent the platform from over-recommending similar content in the future.

"If you watch a video on YouTube, the platform learns that you like it. If you don't watch it, they learn you don't like it. But what we heard is that users are strategizing. They may see a YouTube video and actually like it, but they know that if they click on it, they will get millions of the same videos for the next three weeks. So, they don't watch the video," Li says, adding that "when this happens, the data that's being collected is not representative of the user's true preferences."

Experimental Music

To study how users adapt their behavior in response to recommendation systems, Li and her co-authors created their own music streaming app—essentially a simplified version of Spotify. This gave them total control over what users saw and how the system reacted. By stripping away real-world platform complexities, they could focus entirely on whether users tried to “game” the algorithm.

The study’s 750 participants were randomly assigned to different conditions in a controlled environment. Everyone listened to songs and could “like” or “dislike” them, or just skip ahead. In the first session, participants used the music player naturally, as if they were on a real platform.

In the following session, participants were randomly told different things about how the recommendation algorithm worked. Some were told the system cared most about likes/dislikes, others were told it prioritized listening time, and a control group got no guidance.

This setup let researchers test how user behavior changed depending on what users believed the algorithm cared about—without changing the actual algorithm. By observing how people’s actions varied under these scenarios, the researchers could see whether users acted strategically—choosing actions not just based on personal enjoyment, but also based on what they thought would “train” the algorithm in their favor. he main behavioral metrics tracked included:

The researchers paid close attention to the number of “likes” and “dislikes” and how long users stayed on each song, or dwell time. The researchers also conducted follow-up surveys to confirm whether users admitted to similar strategic behaviors on real-world platforms like Spotify or TikTok.

Li suggested that the fact users are strategizing indicates that recommendation systems, such as Instagram’s “Explore” page, may be over-indexing on known user preferences rather than exploring new content. Adjusting the algorithms to be less heavy-handed in pushing familiar content could help address this issue.

She also noted that users would ideally be able to more easily alter the algorithm behind their personal feed rather than strategize their behavior. Giving users more control and transparency over the recommendation system could help mitigate stratification.

Adapted from “Measuring Strategization in Recommendation,” by Hannah Li of Columbia Business School, Sarah H. Cen of Massachusetts Institute of Technology, Andrew Ilyas of Massachusetts Institute of Technology, Jennifer Allen of Massachusetts Institute of Technology, and Aleksander Mądry Massachusetts Institute of Technology.

Also adapted from “Experimental Design in Two-Sided Platforms,” by Hannah Li of Columbia Business School, Ramesh Johari of Stanford University, Inessa Liskovich of Stanford University, and Gabriel Y. Weintraub of Stanford University.