Interference in A/B Tests

While interference is a major problem affecting A/B tests, we can redesign the experiment software platform and create new statistical libraries to mitigate its effects.

Saurabh Kumar

Jul. 31, 24 · Tutorial

Likes (1)

Comment

Save

4.2K Views

A/B testing is the gold standard for online experimentation used by most companies to test out their product features. Whereas A/B test experimentation works just fine in most settings, it is particularly susceptible to interference bias, particularly in the case of online marketplaces or social networks. In this article, we aim to look at the situations with interference bias and some potential ways to mitigate its effect on evaluation.

SUTVA, the Fundamental Assumption of A/B Testing and Its Violations

One of the fundamental assumptions of A/B Testing is SUTVA — Stable Unit Treatment Value Assumption. The potential outcome of treatment in the randomization unit depends only on the treatment they receive and not on the treatments assigned to other subjects.

This is violated often in experiments on marketplaces and social networks. Some examples of potential violations include:

A/B test experiments on Social networks. For example, let’s say we want to understand the effect of adding a ‘Stories’ feature on Instagram. A feature that increases engagement for people in the treatment arm can affect people connected to them in the control arm. The people in the control arm respond to their stories and this can increase their engagement. This is an example where the real treatment effect is less than what we see in the experiment.
A/B test experiments on Rideshare marketplaces: Let’s say a rideshare marketplace introduces a discount for the rider and wants to test it against the control of no discount. Also, the metric of interest is the Number of Rides. What happens though, is that if treatment riders start requesting more rides, fewer drivers will be available to the control riders. The treatment effect in this case is exaggerated.
A similar example is an ads marketplace where multiple campaigns compete for an ad. However, let’s say that there is a fixed advertiser budget. That budget is shared across treatment and control. Imagine our proposed feature increases the click-through rate. In treatment, if the budget starts being spent more, there is less budget available for the control group. The treatment effect again is inflated in this case.

Mitigating Interference Effects

We can mitigate the impact of interference in A/B tests through a combination of modifying experiment setup and causal inference techniques. I have delved into the intuition behind the techniques rather than technical details. I will share the references so you can do it later too.

Budget Split Testing

This is generally used when a common resource like an advertiser budget is shared between treatment and control. In this, the budget is split in the ratio of the experiment traffic so that treatment and control have their own budget and there is no cannibalization. This method can be costly as it can lead to underutilization of the budget. More details on this can be found in the paper by Min Liu et al., 2021. I gave a skeleton code to create a budget-split experimentation system.

    Python
   
 

   import random

class BudgetSplitTest:
    def __init__(self, total_budget, control_traffic_ratio):
        self.total_budget = total_budget
        self.control_traffic_ratio = control_traffic_ratio
        self.treatment_traffic_ratio = 1 - control_traffic_ratio
        
        # Split budget based on traffic ratio
        self.control_budget = total_budget * control_traffic_ratio
        self.treatment_budget = total_budget * self.treatment_traffic_ratio
        
        # Track spent budget and conversions
        self.control_spent = 0
        self.treatment_spent = 0
        self.control_conversions = 0
        self.treatment_conversions = 0

    def run_experiment(self, total_impressions):
        for _ in range(total_impressions):
            if random.random() < self.control_traffic_ratio:
                self._serve_control_ad()
            else:
                self._serve_treatment_ad()

    def _serve_control_ad(self):
        if self.control_spent < self.control_budget:
            spend = min(random.uniform(0.1, 1.0), self.control_budget - self.control_spent)
            self.control_spent += spend
            if random.random() < 0.1:  # 10% conversion rate for control
                self.control_conversions += 1

    def _serve_treatment_ad(self):
        if self.treatment_spent < self.treatment_budget:
            spend = min(random.uniform(0.1, 1.0), self.treatment_budget - self.treatment_spent)
            self.treatment_spent += spend
            if random.random() < 0.15:  # 15% conversion rate for treatment
                self.treatment_conversions += 1

    def get_results(self):
        return {
            "Control": {
                "Budget": round(self.control_budget, 2),
                "Spent": round(self.control_spent, 2),
                "Conversions": self.control_conversions,
                "CPA": round(self.control_spent / self.control_conversions, 2) if self.control_conversions else 0
            },
            "Treatment": {
                "Budget": round(self.treatment_budget, 2),
                "Spent": round(self.treatment_spent, 2),
                "Conversions": self.treatment_conversions,
                "CPA": round(self.treatment_spent / self.treatment_conversions, 2) if self.treatment_conversions else 0
            }
        }

# Run the experiment
total_budget = 10000
control_traffic_ratio = 0.5  # 50% traffic to control, 50% to treatment
total_impressions = 100000

experiment = BudgetSplitTest(total_budget, control_traffic_ratio)
experiment.run_experiment(total_impressions)
results = experiment.get_results()
  

Switchback Experiments

Switchbacks are more common in two-sided marketplaces like Lyft, Uber, and Doordash and all users switch between treatment and control. The randomization unit here is not users but time units. This method can have spillover effects from treatment to control if time intervals are short otherwise it can lead to underpower of experiments. We can increase the power by using methods like regression analysis.

    Python
   
 

   import random
from datetime import datetime, timedelta

class SwitchbackExperiment:
    def __init__(self, experiment_name, start_time, end_time, interval_hours=1):
        self.name = experiment_name
        self.start_time = start_time
        self.end_time = end_time
        self.interval_hours = interval_hours
        self.schedule = self._create_schedule()
        self.data = []

    def _create_schedule(self):
        schedule = []
        current_time = self.start_time
        while current_time < self.end_time:
            schedule.append({
                'start': current_time,
                'end': current_time + timedelta(hours=self.interval_hours),
                'variant': random.choice(['control', 'treatment'])
            })
            current_time += timedelta(hours=self.interval_hours)
        return schedule

    def get_active_variant(self, timestamp):
        for interval in self.schedule:
            if interval['start'] <= timestamp < interval['end']:
                return interval['variant']
        return None  # Outside experiment time range

    def record_event(self, timestamp, metric_value):
        variant = self.get_active_variant(timestamp)
        if variant:
            self.data.append({
                'timestamp': timestamp,
                'variant': variant,
                'metric_value': metric_value
            })

    def get_results(self):
        control_data = [event['metric_value'] for event in self.data if event['variant'] == 'control']
        treatment_data = [event['metric_value'] for event in self.data if event['variant'] == 'treatment']

        return {
            'control': {
                'count': len(control_data),
                'total': sum(control_data),
                'average': sum(control_data) / len(control_data) if control_data else 0
            },
            'treatment': {
                'count': len(treatment_data),
                'total': sum(treatment_data),
                'average': sum(treatment_data) / len(treatment_data) if treatment_data else 0
            }
        }

# Example usage
if __name__ == "__main__":
    # Set up the experiment
    start = datetime(2023, 5, 1, 0, 0)
    end = datetime(2023, 5, 8, 0, 0)  # One week experiment
    exp = SwitchbackExperiment("New Pricing Algorithm", start, end, interval_hours=4)

    # Simulate events (e.g., rides in a rideshare app)
    current_time = start
    while current_time < end:
        # Simulate more rides during peak hours
        num_rides = random.randint(5, 20)
        if 7 <= current_time.hour <= 9 or 16 <= current_time.hour <= 18:
            num_rides *= 2

        for _ in range(num_rides):
            # Simulate a ride
            ride_time = current_time + timedelta(minutes=random.randint(0, 59))
            ride_value = random.uniform(10, 50)  # Ride value between $10 and $50
            exp.record_event(ride_time, ride_value)

        current_time += timedelta(hours=1)

    # Analyze results
    results = exp.get_results()
  

Graph Cluster Randomization (GCR)

In social network experiments, graph cluster randomization is a technique used to further reduce interference bias. This method takes into account the network structure when forming clusters, helping to isolate treatment effects within network communities. Clusters are then randomly assigned treatment and control. Interference is reduced automatically because of the isolated clusters.

Resource-Adjusted Metrics

Rather than solely focusing on absolute outcomes, we can use metrics that account for resource allocation. For instance, in an ad campaign, instead of just measuring clicks, we might track cost per click or return on ad spend, which normalizes the results across varying budget levels.

Synthetic Control

In cases of interference, synthetic control groups can be constructed to model the effect of treatment on a metric for a unit based on the metrics of other units. For example, let’s say we take the country as a unit, In a pretest period, the metrics of a country are modeled with respect to metrics of other countries. After we promote the feature in a country, we can model the effect of the promotion intervention by comparing the metric with the metric predicted by the model. The variance of results might be high and enough to measure small effects.

ITSA

Interrupted time series model. Define an intervention point like a feature promotion and then use the pre-intervention time series to predict observations. Compare it with the actual observations to see if the intervention had an effect on the time series.

Staggered Rollouts

Gradually introduce changes to a small subset of users and monitor the results before expanding the rollout. This allows you to detect potential issues early on and mitigate the impact of interference.

In reality, all these methods should be used in conjunction with AB testing methods. For example, some metrics could be defined to see if interference is detected for the ads marketplace. If it's not a problem, AB test results can be trusted, otherwise, we can go for a budget split test.

Time series Testing Test case

Published at DZone with permission of Saurabh Kumar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending