API Reference for PPI¶
Documentation for functions implementing prediction-powered inference can be found here.
- ppi_py.ppi_mean_pointestimate(Y, Yhat, Yhat_unlabeled, lam=None, coord=None, w=None, w_unlabeled=None, lam_optim_mode='overall')[source]¶
Computes the prediction-powered point estimate of the d-dimensional mean.
- Parameters:
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical point estimate.
coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the dimension of the estimand.
w (ndarray, optional) – Sample weights for the labeled data set. Defaults to all ones vector.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set. Defaults to all ones vector.
- Returns:
Prediction-powered point estimate of the mean.
- Return type:
float or ndarray
Notes
[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.
- ppi_py.ppi_mean_ci(Y, Yhat, Yhat_unlabeled, alpha=0.1, alternative='two-sided', lam=None, coord=None, w=None, w_unlabeled=None, lam_optim_mode='overall')[source]¶
Computes the prediction-powered confidence interval for a d-dimensional mean.
- Parameters:
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in (0, 1).
alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.
lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.
coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.
w (ndarray, optional) – Sample weights for the labeled data set.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.
- Returns:
Lower and upper bounds of the prediction-powered confidence interval for the mean.
- Return type:
tuple
Notes
[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.
- ppi_py.ppi_mean_pval(Y, Yhat, Yhat_unlabeled, null=0, alternative='two-sided', lam=None, coord=None, w=None, w_unlabeled=None, lam_optim_mode='overall')[source]¶
Computes the prediction-powered p-value for a 1D mean.
- Parameters:
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
null (float) – Value of the null hypothesis to be tested.
alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.
lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.
coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.
w (ndarray, optional) – Sample weights for the labeled data set.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.
- Returns:
Prediction-powered p-value for the mean.
- Return type:
float or ndarray
Notes
[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.
- ppi_py.ppi_quantile_pointestimate(Y, Yhat, Yhat_unlabeled, q, exact_grid=False, w=None, w_unlabeled=None)[source]¶
Computes the prediction-powered point estimate of the quantile.
- Parameters:
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
q (float) – Quantile to estimate.
exact_grid (bool, optional) – Whether to compute the exact solution (True) or an approximate solution based on a linearly spaced grid of 5000 values (False).
w (ndarray, optional) – Sample weights for the labeled data set.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.
- Returns:
Prediction-powered point estimate of the quantile.
- Return type:
float
- ppi_py.ppi_quantile_ci(Y, Yhat, Yhat_unlabeled, q, alpha=0.1, exact_grid=False, w=None, w_unlabeled=None)[source]¶
Computes the prediction-powered confidence interval for the quantile.
- Parameters:
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
q (float) – Quantile to estimate. Must be in the range (0, 1).
alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in the range (0, 1).
exact_grid (bool, optional) – Whether to use the exact grid of values or a linearly spaced grid of 5000 values.
w (ndarray, optional) – Sample weights for the labeled data set.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.
- Returns:
Lower and upper bounds of the prediction-powered confidence interval for the quantile.
- Return type:
tuple
- ppi_py.ppi_ols_pointestimate(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, lam=None, coord=None, w=None, w_unlabeled=None)[source]¶
Computes the prediction-powered point estimate of the OLS coefficients.
- Parameters:
X (ndarray) – Covariates corresponding to the gold-standard labels.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical point estimate.
coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.
w (ndarray, optional) – Sample weights for the labeled data set.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.
- Returns:
Prediction-powered point estimate of the OLS coefficients.
- Return type:
ndarray
Notes
[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.
- ppi_py.ppi_ols_ci(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, alpha=0.1, alternative='two-sided', lam=None, coord=None, w=None, w_unlabeled=None)[source]¶
Computes the prediction-powered confidence interval for the OLS coefficients using the PPI++ algorithm from [ADZ23].
- Parameters:
X (ndarray) – Covariates corresponding to the gold-standard labels.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in the range (0, 1).
alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.
lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.
coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.
w (ndarray, optional) – Sample weights for the labeled data set.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.
- Returns:
Lower and upper bounds of the prediction-powered confidence interval for the OLS coefficients.
- Return type:
tuple
Notes
[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.
- ppi_py.ppi_logistic_pointestimate(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, lam=None, coord=None, optimizer_options=None, w=None, w_unlabeled=None)[source]¶
Computes the prediction-powered point estimate of the logistic regression coefficients.
- Parameters:
X (ndarray) – Covariates corresponding to the gold-standard labels.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical point estimate.
coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.
optimizer_options (dict, optional) – Options to pass to the optimizer. See scipy.optimize.minimize for details.
w (ndarray, optional) – Sample weights for the labeled data set.
w_unlabeled (ndarray, optional) – Sample weights for the unlabeled data set.
- Returns:
Prediction-powered point estimate of the logistic regression coefficients.
- Return type:
ndarray
Notes
[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.
- ppi_py.ppi_logistic_ci(X, Y, Yhat, X_unlabeled, Yhat_unlabeled, alpha=0.1, alternative='two-sided', lam=None, coord=None, optimizer_options=None, w=None, w_unlabeled=None)[source]¶
Computes the prediction-powered confidence interval for the logistic regression coefficients using the PPI++ algorithm from [ADZ23].
- Parameters:
X (ndarray) – Covariates corresponding to the gold-standard labels.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
X_unlabeled (ndarray) – Covariates corresponding to the unlabeled data.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in the range (0, 1).
alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’.
lam (float, optional) – Power-tuning parameter (see [ADZ23]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPI with no power tuning, and setting lam=0 recovers the classical CLT interval.
coord (int, optional) – Coordinate for which to optimize lam. If None, it optimizes the total variance over all coordinates. Must be in {1, …, d} where d is the shape of the estimand.
optimizer_options (dict, ooptional) – Options to pass to the optimizer. See scipy.optimize.minimize for details.
w (ndarray, optional) – Weights for the labeled data. If None, it is set to 1.
w_unlabeled (ndarray, optional) – Weights for the unlabeled data. If None, it is set to 1.
- Returns:
Lower and upper bounds of the prediction-powered confidence interval for the logistic regression coefficients.
- Return type:
tuple
Notes
[ADZ23] A. N. Angelopoulos, J. C. Duchi, and T. Zrnic. PPI++: Efficient Prediction Powered Inference. arxiv:2311.01453, 2023.
- ppi_py.ppboot(estimator, Y, Yhat, Yhat_unlabeled, X=None, X_unlabeled=None, lam=None, n_resamples=1000, n_resamples_lam=50, alpha=0.1, alternative='two-sided', method='percentile')[source]¶
Computes the prediction-powered bootstrap confidence interval for the estimator.
- Parameters:
estimator (callable) – Estimator function. Takes in (X,Y) or (Y) and returns a point estimate.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
X (ndarray, optional) – Covariates corresponding to the gold-standard labels. Defaults to None. If None, the estimator is assumed to only take in Y.
X_unlabeled (ndarray, optional) – Covariates corresponding to the unlabeled data. Defaults to None. If None, the estimator is assumed to only take in Y. If X is not None, X_unlabeled must also be provided, and vice versa.
lam (float, optional) – Power-tuning parameter (see [ADZ23] in addition to [Z24]). The default value None will estimate the optimal value from data. Setting lam=1 recovers PPBoot with no power tuning, and setting lam=0 recovers the classical bootstrap interval.
n_resamples (int, optional) – Number of bootstrap resamples. Defaults to 1000.
n_resamples_lam (int, optional) – Number of bootstrap resamples for the power-tuning parameter. Defaults to 50.
alpha (float, optional) – Error level; the confidence interval will target a coverage of 1 - alpha. Must be in (0, 1). Defaults to 0.1.
alternative (str, optional) – Alternative hypothesis, either ‘two-sided’, ‘larger’ or ‘smaller’. Defaults to ‘two-sided’.
method (str, optional) – Method to compute the confidence interval, either ‘percentile’ or ‘basic’. Defaults to ‘percentile’.
- Returns:
Lower and upper bounds of the prediction-powered bootstrap confidence interval for the estimator.
- Return type:
float or ndarray
Notes
[Z24] T. Zrnic. A Note on the Prediction-Powered Bootstrap. arxiv:2405.18379, 2024.
- ppi_py.ppi_distribution_label_shift_ci(Y, Yhat, Yhat_unlabeled, K, nu, alpha=0.1, delta=None, return_counts=True)[source]¶
Computes the prediction-powered confidence interval for nu^T f for a discrete distribution f, under label shift.
- Parameters:
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
Yhat_unlabeled (ndarray) – Predictions corresponding to the unlabeled data.
K (int) – Number of classes.
nu (ndarray) – Vector nu. Coordinates must be bounded within [0, 1].
alpha (float, optional) – Final error level; the confidence interval will target a coverage of 1 - alpha. Must be in (0, 1).
delta (float, optional) – Error level of the intermediate confidence interval for the mean. Must be in (0, alpha). If return_counts == False, then delta is set equal to alpha and ignored.
return_counts (bool, optional) – Whether to return the number of samples in each class as opposed to the mean.
- Returns:
Lower and upper bounds of the prediction-powered confidence interval for nu^T f for a discrete distribution f, under label shift.
- Return type:
tuple