API Reference for Baselines¶
Documentation for functions implementing baseline inference strategies can be found here. These are functions that either use only gold-standard data or use gold-standard + unlabeled data in a way that is not consistent with or part of the PPI framework.
- ppi_py.classical_mean_ci(Y, w=None, alpha=0.1, alternative='two-sided')[source]¶
Classical mean confidence interval using the central limit theorem.
- Parameters:
Y (ndarray) – Array of observations.
w (ndarray, optional) – Sample weights for the data set. Must be positive and will be normalized to sum to the size of the dataset.
alpha (float, optional) – Error level. Confidence interval will target a coverage of 1 - alpha. Defaults to 0.1. Must be in (0, 1).
alternative (str, optional) – One of “two-sided”, “larger”, or “smaller”. Defaults to “two-sided”.
- Returns:
(lower, upper) confidence interval bounds.
- Return type:
tuple
- ppi_py.semisupervised_mean_ci(X, Y, X_unlabeled, K, alpha=0.1, alternative='two-sided', add_intercept=True)[source]¶
Semisupervised mean confidence interval from [ZB22].
- Parameters:
X (ndarray) – Labeled covariates.
Y (ndarray) – Labeled responses.
X_unlabeled (ndarray) – Unlabeled covariates.
K (int) – Number of folds for cross-fitting.
alpha (float, optional) – Error level. Confidence interval will target a coverage of 1 - alpha. Defaults to 0.1. Must be in (0, 1).
alternative (str, optional) – One of “two-sided”, “larger”, or “smaller”. Defaults to “two-sided”.
add_intercept (bool, optional) – Whether to add an intercept to the covariates. Defaults to True.
- Returns:
(lower, upper) confidence interval bounds.
- Return type:
tuple
Notes
[ZB22] Y. Zhang and J. Bradic, High-dimensional semi-supervised learning: in search of optimal inference of the mean. arxiv:1902.00772, 2022.
- ppi_py.conformal_mean_ci(Y, Yhat, Yhat_unlabeled, alpha=0.1, bonferroni=True)[source]¶
Confidence interval for the mean using conformal inference.
This method has distribution-free coverage guarantees with bonferroni=True. It tends to be extremely conservative. The method works by making a conformal interval for each unlabeled sample and averaging the endpoints. In order to get a valid interval, the individual conformal intervals are made at a level of 1 - alpha / N, where N is the number of unlabeled samples (this is a Bonferroni correction required for simultaneous inference). Of course, the intervals can be made less conservative by setting bonferroni=False, but this will result in invalid coverage guarantees. In practice, this method is not recommended.
- Parameters:
Y (ndarray) – Labeled responses.
Yhat (ndarray) – Predicted responses for labeled samples.
Yhat_unlabeled (ndarray) – Predicted responses for unlabeled samples.
alpha (float, optional) – Error level. Confidence interval will target a coverage of 1 - alpha. Defaults to 0.1. Must be in (0, 1).
bonferroni (bool, optional) – Whether to use a Bonferroni correction for simultaneous inference. Defaults to True.
- Returns:
(lower, upper) confidence interval bounds.
- Return type:
tuple
- ppi_py.classical_quantile_ci(Y, q, alpha=0.1)[source]¶
Confidence interval for a quantile using the classical method.
- Parameters:
Y (ndarray) – Labeled responses.
q (float) – Quantile to estimate. Must be in (0, 1).
alpha (float, optional) – Error level. Confidence interval will target a coverage of 1 - alpha. Defaults to 0.1. Must be in (0, 1).
- Returns:
(lower, upper) confidence interval bounds.
- Return type:
tuple
- ppi_py.classical_ols_ci(X, Y, w=None, alpha=0.1, alternative='two-sided')[source]¶
Confidence interval for the OLS coefficients using the classical method.
- Parameters:
X (ndarray) – Labeled features.
Y (ndarray) – Labeled responses.
w (ndarray, optional) – Sample weights for the data set. Must be positive and will be normalized to sum to the size of the dataset.
alpha (float, optional) – Error level. Confidence interval will target a coverage of 1 - alpha. Defaults to 0.1. Must be in (0, 1).
alternative (str, optional) – One of “two-sided”, “less”, or “greater”. Defaults to “two-sided”.
- Returns:
(lower, upper) confidence interval bounds.
- Return type:
tuple
- ppi_py.postprediction_ols_ci(Y, Yhat, X_unlabeled, Yhat_unlabeled, bootstrap_samples=50, alpha=0.1, alternative='two-sided')[source]¶
Confidence interval for the OLS coefficients using the PostPI method from [WML20].
This method does not possess any coverage guarantees unless the model is perfect, but predates Prediction-Powered Inference. It is included for comparison purposes.
- Parameters:
Y (ndarray) – Labeled responses.
Yhat (ndarray) – Predicted responses for labeled samples.
X_unlabeled (ndarray) – Unlabeled features.
Yhat_unlabeled (ndarray) – Predicted responses for unlabeled samples.
bootstrap_samples (int, optional) – Number of bootstrap samples to use. Defaults to 50.
alpha (float, optional) – Error level. Confidence interval will target a coverage of 1 - alpha. Defaults to 0.1. Must be in (0, 1).
alternative (str, optional) – One of “two-sided”, “less”, or “greater”. Defaults to “two-sided”.
- Returns:
(lower, upper) confidence interval bounds.
- Return type:
tuple
Notes
[WML20] S. Wang, T. H. McCormick, and J. T. Leek, Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48): 30266-30275, 2020.
- ppi_py.logistic(X, Y)[source]¶
Compute the logistic regression coefficients.
- Parameters:
X (ndarray) – Labeled features.
Y (ndarray) – Labeled responses.
- Returns:
Logistic regression coefficients.
- Return type:
ndarray
- ppi_py.classical_logistic_ci(X, Y, alpha=0.1, alternative='two-sided')[source]¶
Confidence interval for the logistic regression coefficients using the classical method.
- Parameters:
X (ndarray) – Labeled
Y (ndarray) – Labeled responses.
alpha (float, optional) – Error level. Confidence interval will target a coverage of 1 - alpha. Defaults to 0.1. Must be in (0, 1).
alternative (str, optional) – One of “two-sided”, “less”, or “greater”. Defaults to “two-sided”.
- Returns:
(lower, upper) confidence interval bounds.
- Return type:
tuple