AB testing and Product Learning Notes

4 minute read

Task Lists

Introduction and the Motivation
Running and Analyzing Experiment: An end to end example
Experiment trustworthiness
Organizational Metrics

Based on the book Trustworthy Online controlled experiments

Introduction and Motivation

Online controlled experiment(A/B testing)

The most common online experiment setting

Users are randomly split between variants in a persistent manner. It has two groups in the simplest version: Treatment B and Control A.
Overall Evaluation Criterion(OEC) quantitive measure of the experiment’s objective, e.g. DAU, LTV, session per user(usage), relevance(time to success). The requirement for OEC:
1. Measurable in the short term
2. Causally drive long-term strategic objective

Metric teams: choosing metrics, validating metrics and evolving metric over time

Motivation

Case desciption:

Increase the user churn rate in a subscription business, e.g. Amazon Prime, Spotify memebership. To achieve this we may need to introduce a new feature and prove there is causal relationship between the new feature and the user churn rate. causal inference related knowledge might required in this process. A thing we need to remember is Correlation doesn’t imply causality.

hierarchy of evidence of assessing the quality of trial design.

Motivation:

Establish causality with high probability
Detect minor and unexpected changes

Experiment Trustworthiness

Extreme results

potential explanations
- instrumentation(logging)
- computational error
- loss of data

Misinterpretation of the result

Common errors
- Lack of power
- P-values interpretation: the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event.

Multiple Comparisons

Common senarios
- Multiple metrics
- P-values over time
- Population segementation
- multiple iterations

Confidential Interval

quantify the degree of uncertainty in the treatment effect

Interpretation of CI: how often the CI computed from many studies would contain the true treatment effect.

Internal Validity

SUTVA – no interection
- exmaple of violations: social networks;communication tools; collaboration tools; shared documentation; two-sided market(ebay/uber/lyft)
Intention to Treat
Survivorship Bias
Sample Ratio Mismatch(SRM)
- Browser redirect
- Lossy instrumentation
- Residual or carryover effect

External Validity(Generalization)

Primary Effect and Novelty effect

highlight by take the new users

Segementation

Market/Country
Device/Platform
Time of Day/Day of Week
User Type(new/existing/returning)
Single/Shared/Family

Conditional Average Treatment Effect in causal inference

Simpson Paradox

$\frac{b}{a}<\frac{B}{A},\frac{d}{c}<\frac{D}{C},\frac{b+d}{a+c}>\frac{B+D}{A+C}$

Possible reasons:
- Different Treatment/Control Assignment in different segementation.
- Sampled not uniform

Organizational Metrics

Goal Metrics (Sucess Metrics)

May not be easy to move in the short term since each intiative may only have a very small impact on the metric or impacts take a long time to materialize.

Goal metrics should have the characteristic:

Simple
Stable.

Driver Metrics (Sign post Metric)

Shorter-term, fast-moving, more sensitive

HEART Happiness, Engagement, Adoption, Rentention and Task Success
AARRR! Acqusition, Activation, Retention, Referral, Revenue

Driver metrics should have the characteristics:

Aligned with the goal: driver of sucess
Actionable and relevant
Sensitive
Resistant to gaming

Guardrail Metrics

guard against violated assumptions, latency

HTML response size per page
JS error per page
Pageviews per user
Client crash

Other taxonomies

Asset & Engagement
- Accumulation of static assets, total number of Facebook users/ total number of connection
- The value a user receives as a result of an action/ by others using the product, session/pageview
Business & Operational
- Track the health of business, Revenue per user/DAU
- Track operational concerns, Queries per second

we need to measure goal, driver, and guardrials at both company level and team level.

Aligning goal and driver metrics to overal business strategy

An increasing in latency of even a few milliseconds can result in revenue loss and a reuduction in user statisfaction.

Unconstrained metrics are gameable

Evaludating Metrics

Establishing the causal relationship of driver metrics to organizational goal

Running and Analyzing the experiment

Case study

Objective marketing departments want to increase sales by sending coupon codes/ promotion code for discount. by simply adding a coupon field at the check out page. We need to evaluate the effect of changing UI. Access the feasibility of the new business model.
Concerns The newly added field would distract people from checking out, further leads to revenue decreasing.
Approaches Fake door/ painted door approaches.
Goal Metrics Revenue per user; randomization unit is user
Users included The users who start the purchasing process
- Not all users: noisy since most of the users won’t even start the process
- Not the users who complete purchase:
Hypothesis Testing
1. baseline mean(for most metrics) and standard error
2. experiment size/randomized units/how long to run
3. statistical significance and practical significance
How long to run the experiment
1. More users need to be included – to increase the statistical power.
2. Day-of-week effect – capture the weekly cycle– experiment longer than 1 week
3. Seasonality – e.g holidays
4. Primacy and novelty effects experiments take time to stablize

Before interpret the result, we might need to have a sanity check

Results to decisions

Case one: no statistical significance and practical significance and the CI is not very wide or small. Reiterate/abandon the idea.
Case two: both statistically significant and practical significance. Launch
Case three: have statistical significance but not practical significance Discuss whether it worths it.
Case four: Confidence Interval is very wide, Include more units gain more power.
Case 5: practical significant and not statistical significant repeat the experiment.

Share on

Twitter Facebook LinkedIn

Chang Shen