API Reference for PPI

Documentation for functions implementing prediction-powered inference can be found here.

ppi_py.ppi_mean_pointestimate(Y, Yhat, Yhat_unlabeled, lam=None, coord=None, w=None, w_unlabeled=None, lam_optim_mode='overall')[source]

Computes the prediction-powered point estimate of the d-dimensional mean.

Parameters:
  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical point estimate.

  • coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the dimension of the estimand.

  • w (ndarray, optional) – Sample weights for the labeled data set. Defaults to all ones vector.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set. Defaults to all ones vector.

Returns:

Prediction-powered point estimate of the mean.

Return type:

float or ndarray

Notes

[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.

ppi_py.ppi_mean_ci(Y, Yhat, Yhat_unlabeled, alpha=0.1, alternative='two-sided', lam=None, coord=None, w=None, w_unlabeled=None, lam_optim_mode='overall')[source]

Computes the prediction-powered confidence interval for a d-dimensional mean.

Parameters:
  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in (0, 1).

  • alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.

  • coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.

  • w (ndarray, optional) – Sample weights for the labeled data set.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.

Returns:

Lower and upper bounds of the prediction-powered confidence interval for the mean.

Return type:

tuple

Notes

[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.

ppi_py.ppi_mean_pval(Y, Yhat, Yhat_unlabeled, null=0, alternative='two-sided', lam=None, coord=None, w=None, w_unlabeled=None, lam_optim_mode='overall')[source]

Computes the prediction-powered p-value for a 1D mean.

Parameters:
  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • null (float) – Value of the null hypothesis to be tested.

  • alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.

  • coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.

  • w (ndarray, optional) – Sample weights for the labeled data set.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.

Returns:

Prediction-powered p-value for the mean.

Return type:

float or ndarray

Notes

[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.

ppi_py.ppi_quantile_pointestimate(Y, Yhat, Yhat_unlabeled, q, exact_grid=False, w=None, w_unlabeled=None)[source]

Computes the prediction-powered point estimate of the quantile.

Parameters:
  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • q (float) – Quantile to estimate.

  • exact_grid (bool, optional) – Whether to compute the exact solution (True) or an approximate solution based on a linearly spaced grid of 5000 values (False).

  • w (ndarray, optional) – Sample weights for the labeled data set.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.

Returns:

Prediction-powered point estimate of the quantile.

Return type:

float

ppi_py.ppi_quantile_ci(Y, Yhat, Yhat_unlabeled, q, alpha=0.1, exact_grid=False, w=None, w_unlabeled=None)[source]

Computes the prediction-powered confidence interval for the quantile.

Parameters:
  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • q (float) – Quantile to estimate. Must be in the range (0, 1).

  • alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in the range (0, 1).

  • exact_grid (bool, optional) – Whether to use the exact grid of values or a linearly spaced grid of 5000 values.

  • w (ndarray, optional) – Sample weights for the labeled data set.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.

Returns:

Lower and upper bounds of the prediction-powered confidence interval for the quantile.

Return type:

tuple

ppi_py.ppi_ols_pointestimate(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, lam=None, coord=None, w=None, w_unlabeled=None)[source]

Computes the prediction-powered point estimate of the OLS coefficients.

Parameters:
  • X (ndarray) – Covariates corresponding to the gold-standard labels.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical point estimate.

  • coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.

  • w (ndarray, optional) – Sample weights for the labeled data set.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.

Returns:

Prediction-powered point estimate of the OLS coefficients.

Return type:

ndarray

Notes

[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.

ppi_py.ppi_ols_ci(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, alpha=0.1, alternative='two-sided', lam=None, coord=None, w=None, w_unlabeled=None)[source]

Computes the prediction-powered confidence interval for the OLS coefficients using the PPI++ algorithm from [ADZ23].

Parameters:
  • X (ndarray) – Covariates corresponding to the gold-standard labels.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in the range (0, 1).

  • alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.

  • coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.

  • w (ndarray, optional) – Sample weights for the labeled data set.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.

Returns:

Lower and upper bounds of the prediction-powered confidence interval for the OLS coefficients.

Return type:

tuple

Notes

[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.

ppi_py.ppi_logistic_pointestimate(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, lam=None, coord=None, optimizer_options=None, w=None, w_unlabeled=None)[source]

Computes the prediction-powered point estimate of the logistic regression coefficients.

Parameters:
  • X (ndarray) – Covariates corresponding to the gold-standard labels.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical point estimate.

  • coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.

  • optimizer_options (dict, optional) – Options to pass to the optimizer. See scipy.optimize.minimize for details.

  • w (ndarray, optional) – Sample weights for the labeled data set.

  • w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.

Returns:

Prediction-powered point estimate of the logistic regression coefficients.

Return type:

ndarray

Notes

[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.

ppi_py.ppi_logistic_ci(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, alpha=0.1, alternative='two-sided', lam=None, coord=None, optimizer_options=None, w=None, w_unlabeled=None)[source]

Computes the prediction-powered confidence interval for the logistic regression coefficients using the PPI++ algorithm from [ADZ23].

Parameters:
  • X (ndarray) – Covariates corresponding to the gold-standard labels.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in the range (0, 1).

  • alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.

  • coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.

  • optimizer_options (dict, ooptional) – Options to pass to the optimizer. See scipy.optimize.minimize for details.

  • w (ndarray, optional) – Weights for the labeled data. If None, it is set to 1.

  • w_unlabeled (ndarray, optional) – Weights for the unlabeled data. If None, it is set to 1.

Returns:

Lower and upper bounds of the prediction-powered confidence interval for the logistic regression coefficients.

Return type:

tuple

Notes

[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.

ppi_py.ppboot(estimator, Y, Yhat, Yhat_unlabeled, X=None, X_unlabeled=None, lam=None, n_resamples=1000, n_resamples_lam=50, alpha=0.1, alternative='two-sided', method='percentile')[source]

Computes the prediction-powered bootstrap confidence interval for the estimator.

Parameters:
  • estimator (callable) – Estimator function. Takes in (X,Y) or (Y) and returns a point estimate.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • X (ndarray, optional) – Covariates corresponding to the gold-standard labels. Defaults to None. If None, the estimator is assumed to only take in Y.

  • X_unlabeled (ndarray, optional) – Covariates corresponding to the unlabeled data. Defaults to None. If None, the estimator is assumed to only take in Y. If X is not None, X_unlabeled must also be provided, and vice versa.

  • lam (float, optional) – Power-tuning parameter (see [ADZ23] in addition to [Z24]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPBoot with no power tuning, and setting lam=0 recovers the classical bootstrap interval.

  • n_resamples (int, optional) – Number of bootstrap resamples. Defaults to 1000.

  • n_resamples_lam (int, optional) – Number of bootstrap resamples for the power-tuning parameter. Defaults to 50.

  • alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in (0, 1). Defaults to 0.1.

  • alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’. Defaults to ‘two-sided’.

  • method (str, optional) – Method to compute the confidence interval, either ‘percentile’ or ‘basic’. Defaults to ‘percentile’.

Returns:

Lower and upper bounds of the prediction-powered bootstrap confidence interval for the estimator.

Return type:

float or ndarray

Notes

[Z24] T. Zrnic. A Note on the Prediction-Powered Bootstrap. arxiv:2405.18379, 2024.

ppi_py.ppi_distribution_label_shift_ci(Y, Yhat, Yhat_unlabeled, K, nu, alpha=0.1, delta=None, return_counts=True)[source]

Computes the prediction-powered confidence interval for nu^T f for a discrete distribution f, under label shift.

Parameters:
  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.

  • K (int) – Number of classes.

  • nu (ndarray) – Vector nu. Coordinates must be bounded within [0, 1].

  • alpha (float, optional) – Final error level; the confidence interval will target a coverage of 1 - alpha. Must be in (0, 1).

  • delta (float, optional) – Error level of the intermediate confidence interval for the mean. Must be in (0, alpha). If return_counts == False, then delta is set equal to alpha and ignored.

  • return_counts (bool, optional) – Whether to return the number of samples in each class as opposed to the mean.

Returns:

Lower and upper bounds of the prediction-powered confidence interval for nu^T f for a discrete distribution f, under label shift.

Return type:

tuple