An Introduction to Regression Discontinuity Design (RDD)

Unlocking Causal Inference: A Hands-On Guide to RDD

Last updated on Feb 6, 2025 Machine Learning

Introduction

Have you ever wondered how a government might evaluate the impact of a new scholarship program that only kicks in when a student’s GPA exceeds a specific threshold? Or how a policy change affecting tax rates for households above a certain income level actually influences economic behavior? These are just a couple of scenarios where Regression Discontinuity Design (RDD) comes into play—a powerful tool for causal inference when treatment assignment hinges on a cutoff.

In this blog post, we’ll explore RDD by building a local linear regression from the ground up—no magic packages or black-box functions here! Instead, we’ll simulate data that features a clear jump at a predetermined cutoff and then walk through the entire estimation process step by step. You’ll see exactly how the treatment effect is estimated right at the threshold and how the confidence interval is constructed to quantify uncertainty.

By starting from zero, we aim to demystify the inner workings of RDD, ensuring that you not only learn how to implement it but also understand the underlying logic behind the method. Whether you’re evaluating educational policies, economic interventions, or any scenario with a clear threshold effect, this hands-on guide is your first step toward mastering RDD.

Regression Discontinuity Design (RDD) is a quasi-experimental method used to estimate causal effects when treatment assignment is determined by a cutoff in a continuous variable (the running variable). The key idea is that units just below and just above the cutoff are very similar, so a jump in the outcome at the cutoff can be interpreted as the causal effect of the treatment.

Mathematically, let $$ m(x) = E[Y \mid X=x] $$

be the conditional expectation of the outcome Y given the running variable X. Suppose that there is a discontinuity at x = c so that: $$ m(x) = \begin{cases} m_0(x), & x < c, \newline m_1(x), & x \ge c. \end{cases} $$ The treatment effect at the cutoff is then estimated as: $$ \hat{\tau} = m_1(c) - m_0(c). $$

The variance of the estimated effect is: $$ \operatorname{Var}(\hat{\tau}) = \operatorname{Var}(m_1(c)) + \operatorname{Var}(m_0(c)) $$

and the 95% confidence interval (CI) is given by: $$ \hat{\tau} \pm z_{0.025} \sqrt{\operatorname{Var}(m_1(c)) + \operatorname{Var}(m_0(c))} $$ with $ z_{0.025} \approx 1.96$.

In the following sections, we illustrate these concepts with a simulated dataset.

1. Simulated Data

We simulate a dataset with 500 observations. The running variable x is drawn uniformly from [-10,10]. The outcome y is generated with a discontinuity (a jump) at x=0. In our simulation, for x < 0 we set: $$ y = 5 + 0.3x + \varepsilon, $$ and for $x \ge 0$: $$ y = 7 + 0.3x + \varepsilon, $$ with$\varepsilon \sim N(0,1)$.

import numpy as np
import pandas as pd

# Set the seed for reproducibility
np.random.seed(42)

# Simulate 500 observations for the running variable x
N = 500
x = np.random.uniform(-10, 10, size=N)

# Generate noise
error = np.random.normal(0, 1, size=N)

# Generate outcome y with a jump at x = 0 (jump of 2 units)
y = np.where(x < 0, 5 + 0.3 * x, 7 + 0.3 * x) + error

# Create a DataFrame
data = pd.DataFrame({'x': x, 'y': y})
print(data.head())

##           x          y
## 0 -2.509198   4.588997
## 1  9.014286  11.580457
## 2  4.639879   9.342387
## 3  1.973170   7.015047
## 4 -6.879627   2.037697

In this simulated example, the cutoff is at $x=0$.

2. Epanechnikov Kernel

The Epanechnikov kernel is used to assign weights to observations based on their distance from the cutoff. Observations closer to the cutoff are given higher weights. The kernel is defined as: $$ K(u) = \begin{cases} 0.75 (1 - u^2) & \text{if } |u| \le 1, \newline 0 & \text{otherwise.} \end{cases} $$ This topic is explained more deeply on my other blog post about Kernel Density Estimation.

def epanechnikov_kernel(u):
    """
    Epanechnikov kernel function.
    
    Returns:
    K(u) = 0.75 * (1 - u^2) for |u| <= 1, and 0 otherwise.
    """
    if abs(u) <= 1:
        return 0.75 * (1 - u**2)
    else:
        return 0

3. Local Linear Regression

To estimate the outcome function on either side of the cutoff, we perform a weighted linear regression. We center the running variable at the cutoff $c=0$ so that: $$ x_{\text{centered}} = x - c = x. $$ The local linear regression model is: $$ y = \beta_0 + \beta_1 x_{\text{centered}} + \varepsilon. $$ At $x_{\text{centered}} = 0$, the predicted outcome is $\beta_0$, which estimates m(c).

def local_linear_regression(df, y_variable, x_variable, cutoff, bandwidth, kernel_function):
    """
    Performs local linear regression on observations below and above the cutoff.
    
    The running variable is centered at the cutoff:
    $$ x_{centered} = x - c $$
    so that the intercept represents the estimated outcome at the cutoff, i.e., $$ m(c) = \\beta_0 $$.
    """
    df_local = df.copy()
    # Center the running variable at the cutoff (c = 0)
    df_local['x_centered'] = df_local[x_variable] - cutoff
    df_local['constant'] = 1

    # Compute kernel weights based on the scaled distance from the cutoff
    df_local['weights'] = df_local['x_centered'].apply(lambda x: kernel_function(x / bandwidth))
    
    # Split data into observations below and above the cutoff
    df_below = df_local[df_local['x_centered'] <= 0]
    df_above = df_local[df_local['x_centered'] > 0]
    
    def _weighted_regression(sub_df, y_var):
        if sub_df.empty:
            return np.array([np.nan, np.nan]), np.full((2, 2), np.nan)
        X = sub_df[['constant', 'x_centered']].values
        Y = sub_df[y_var].values
        W = np.diag(sub_df['weights'].values)
        try:
            XTWX_inv = np.linalg.inv(X.T @ W @ X)
            theta = XTWX_inv @ (X.T @ W @ Y)
            residuals = Y - X @ theta
            variance_matrix = XTWX_inv @ (X.T @ W @ np.diag(residuals**2) @ W @ X) @ XTWX_inv
        except np.linalg.LinAlgError:
            return np.array([np.nan, np.nan]), np.full((2, 2), np.nan)
        return theta, variance_matrix
    
    theta_below, var_below = _weighted_regression(df_below, y_variable)
    theta_above, var_above = _weighted_regression(df_above, y_variable)
    return theta_below, var_below, theta_above, var_above

RDD Estimator

The RDD estimator computes the treatment effect as the difference in the predicted outcomes at the cutoff: $$ \hat{\tau} = \hat{m}_1(0) - \hat{m}_0(0), $$ where $\hat{m}_1(0)$ and $\hat{m}_0(0)$ are the intercepts from the regressions above and below the cutoff, respectively. The variance of the treatment effect is: $$ \operatorname{Var}(\hat{\tau}) = \operatorname{Var}(\hat{m}_1(0)) + \operatorname{Var}(\hat{m}_0(0)), $$ and its standard error is the square root of this sum.

def rdd_estimator(df, y_variable, x_variable, cutoff, bandwidth, kernel_function):
    """
    Estimates the RDD treatment effect as the difference in the predicted outcome at the cutoff.
    
    Since the running variable is centered at the cutoff,
    the predicted outcome is simply the intercept:
    $$ \\hat{m}(c) = \\beta_0. $$
    Thus, the treatment effect is:
    $$ \\hat{\\tau} = \\hat{m}_1(c) - \\hat{m}_0(c). $$
    """
    theta_below, var_below, theta_above, var_above = local_linear_regression(
        df, y_variable, x_variable, cutoff, bandwidth, kernel_function)
    
    if np.any(np.isnan(theta_below)) or np.any(np.isnan(theta_above)):
         return np.nan, np.nan
    
    # Predicted outcome at the cutoff is the intercept (when x_centered = 0)
    pred_below = theta_below[0]
    pred_above = theta_above[0]
    treatment_effect = pred_above - pred_below
    
    # Variance of the intercept estimates (using the selector vector e0 = [1, 0])
    e0 = np.array([1.0, 0.0])
    se = np.sqrt(e0 @ (var_below + var_above) @ e0)
    
    return treatment_effect, se

Running the Analysis and Constructing the Confidence Interval

We now run the RDD estimator using a chosen bandwidth and compute the 95% confidence interval. Recall, the 95% CI is given by: $$ CI_\text{95%} = \hat{\tau} \pm 1.96 \times SE(\hat{\tau}). $$

from scipy.stats import norm
# Choose a bandwidth (experiment with this value)
bandwidth = 2.0

# Run the RDD estimator on the simulated data. Here, the cutoff is 0.
treatment_effect, se = rdd_estimator(data, 'y', 'x', cutoff=0, bandwidth=bandwidth, kernel_function=epanechnikov_kernel)

if not np.isnan(treatment_effect):
    print("RDD Estimate (ATE): {:.4f}".format(treatment_effect))
    print("Standard Error: {:.4f}".format(se))
    
    # Compute the 95% confidence interval.
    alpha = 0.05
    z = norm.ppf(1 - alpha/2)  # Approximately 1.96 for a 95% CI.
    ci_lower = treatment_effect - z * se
    ci_upper = treatment_effect + z * se
    print("95% Confidence Interval: [{:.4f}, {:.4f}]".format(ci_lower, ci_upper))
else:
    print("RDD estimation failed. Please check your data or bandwidth.")

## RDD Estimate (ATE): 1.9539
## Standard Error: 0.3026
## 95% Confidence Interval: [1.3607, 2.5470]

6. Graphical Representation: Plotting the Local Linear Fits and Confidence Intervals

Below we plot the estimated local linear regressions on either side of the cutoff, along with their pointwise 95% confidence intervals.

import matplotlib.pyplot as plt

def plot_rdd_fit(df, y_variable, x_variable, cutoff, bandwidth, kernel_function, n_grid=100, alpha=0.05):
    """
    Plots the local linear regression fits with pointwise 95% confidence intervals.
    
    The regression is estimated separately on each side of the cutoff.
    """
    # Get the estimates from local linear regressions.
    theta_below, var_below, theta_above, var_above = local_linear_regression(df, y_variable, x_variable, cutoff, bandwidth, kernel_function)
    
    if np.any(np.isnan(theta_below)) or np.any(np.isnan(theta_above)):
        print("Local-linear regression failed on one side. No plot produced.")
        return
    
    # Extract coefficients for each side.
    beta0_below, beta1_below = theta_below
    beta0_above, beta1_above = theta_above
    
    # Create grids for values below and above the cutoff.
    x_min, x_max = df[x_variable].min(), df[x_variable].max()
    grid_below = np.linspace(x_min, cutoff, n_grid)
    grid_above = np.linspace(cutoff, x_max, n_grid)
    
    # Compute the z-value for the (1-alpha) CI.
    z_val = norm.ppf(1 - alpha/2)
    
    # Compute predictions and confidence intervals for observations below the cutoff.
    mhat_below = []
    ci_lower_below = []
    ci_upper_below = []
    for xg in grid_below:
        # Prediction: m0(x) = beta0_below + beta1_below * (x - cutoff)
        pred = beta0_below + beta1_below * (xg - cutoff)
        vec = np.array([1.0, xg - cutoff])
        var_pred = vec @ var_below @ vec
        se_pred = np.sqrt(var_pred)
        mhat_below.append(pred)
        ci_lower_below.append(pred - z_val * se_pred)
        ci_upper_below.append(pred + z_val * se_pred)
        
    # Compute predictions and confidence intervals for observations above the cutoff.
    mhat_above = []
    ci_lower_above = []
    ci_upper_above = []
    for xg in grid_above:
        # Prediction: m1(x) = beta0_above + beta1_above * (x - cutoff)
        pred = beta0_above + beta1_above * (xg - cutoff)
        vec = np.array([1.0, xg - cutoff])
        var_pred = vec @ var_above @ vec
        se_pred = np.sqrt(var_pred)
        mhat_above.append(pred)
        ci_lower_above.append(pred - z_val * se_pred)
        ci_upper_above.append(pred + z_val * se_pred)
    
    #Lets create the original lines from the formula
    y_below = 5 + 0.3 * grid_below
    y_above = 7 + 0.3 * grid_above
    
    # Plot the results.
    plt.figure(figsize=(8,6))
    plt.plot(grid_below, mhat_below, color='blue', label=r'$\hat{m}_0(x)$')
    plt.fill_between(grid_below, ci_lower_below, ci_upper_below, color='blue', alpha=0.3)
    plt.plot(grid_above, mhat_above, color='red', label=r'$\hat{m}_1(x)$')
    plt.fill_between(grid_above, ci_lower_above, ci_upper_above, color='red', alpha=0.3)
    plt.axvline(cutoff, color='black', linestyle='--', label='Cutoff')
    plt.xlabel('Running Variable')
    plt.ylabel('Outcome')
    plt.title('Local Linear Regression Fit with 95% Confidence Intervals')
    #Lets also add the original data
    plt.scatter(df[x_variable], df[y_variable], color='black', alpha=0.15, label='Data')
    #Lets plot the original lines
    plt.plot(grid_below, y_below, color='green', label='y = 5 + 0.3x',linestyle='-.')
    plt.plot(grid_above, y_above, color='purple', label='y = 7 + 0.3x',linestyle='-.')
    plt.legend()
    plt.show()

# Call the plotting function.
plot_rdd_fit(data, 'y', 'x', cutoff=0, bandwidth=bandwidth, kernel_function=epanechnikov_kernel)

As we can see, the local linear regression effectively emphasizes the data closest to the cutoff by using weighted observations. While the slope estimates may be more variable—since they rely on data closer to the cutoff, the intercept provides a robust estimate of the treatment effect at the cutoff. This is why the confidence interval is narrow near the cutoff (where the model is most precise) and widens as we move away, reflecting that our inference is primarily focused on the discontinuity point. Overall, the model accurately captures the differences in outcomes between the two sides of the cutoff.

Conclusion

In this post we have:

Simulated data with a discontinuity at $x=0$,
Used an Epanechnikov kernel to weight observations based on their proximity to the cutoff,
Performed local linear regressions on each side of the cutoff to estimate the outcome functions, and
Estimated the treatment effect as the jump at the cutoff, along with its 95% confidence interval.

This example provides a didactic introduction to RDD. By focusing on data near the cutoff, RDD isolates the causal effect of the treatment. Happy coding and exploring RDD!

RD Machine_Learning