API Reference for Power Analysis¶
Documentation for functions implementing power analyses for PPI can be found here.
- ppi_py.ppi_power(ppi_corr, cost_X, cost_Y, cost_Yhat, budget=None, effective_n=None, n_max=None)[source]¶
Computes the optimal pair of sample sizes for PPI when the PPI correlation is known.
- Parameters:
ppi_corr (float) – PPI correlation as defined in [BHvL24].
cost_X (float) – Cost per unlabeled data point.
cost_Y (float) – Cost per gold-standard label.
cost_Yhat (float) – Cost per prediction.
budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.
effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.
n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy n + N <= n_max.
- Returns:
- Return type:
dict
Notes
At least one of budget and effective_n must be provided. If both are provided, budget will be used and the most powerful pair will be returned.
[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt
- ppi_py.ppi_mean_power(Y, Yhat, cost_Y, cost_Yhat, budget=None, effective_n=None, n_max=None, w=None)[source]¶
Computes the optimal pair of sample sizes for estimating the mean with ppi.
- Parameters:
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
cost_Y (float) – Cost per gold-standard label.
cost_Yhat (float) – Cost per prediction.
budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.
effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.
n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.
w (ndarray, optional) – Sample weights for the labeled data set. Defaults to all ones vector.
- Returns:
- Return type:
dict
Notes
At least one of budget and effective_n must be provided. If both are provided, budget will be used and the most powerful pair will be returned.
[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt
- ppi_py.ppi_ols_power(X, Y, Yhat, cost_X, cost_Y, cost_Yhat, coord, budget=None, effective_n=None, n_max=None, w=None)[source]¶
Computes the optimal pair of sample sizes for estimating OLS coefficients with PPI.
- Parameters:
X (ndarray) – Covariates corresponding to the gold-standard labels.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
cost_X (float) – Cost per unlabeled data point.
cost_Y (float) – Cost per gold-standard label.
cost_Yhat (float) – Cost per prediction.
coord (int) – Coordinate to perform power analysis on. Must be in {0, …, d-1} where d is the shape of the estimand.
budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.
effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.
n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.
w (ndarray, optional) – Sample weights for the labeled data set.
- Returns:
- Return type:
dict
Notes
At least one of budget and effective_n must be provided. If both are provided, budget will be used.
[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt
- ppi_py.ppi_logistic_power(X, Y, Yhat, cost_X, cost_Y, cost_Yhat, coord, budget=None, effective_n=None, n_max=None, w=None)[source]¶
Computes the optimal pair of sample sizes for estimating logistic regression coefficients with PPI.
- Parameters:
X (ndarray) – Covariates corresponding to the gold-standard labels.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
cost_X (float) – Cost per unlabeled data point.
cost_Y (float) – Cost per gold-standard label.
cost_Yhat (float) – Cost per prediction.
coord (int) – Coordinate to perform power analysis on. Must be in {0, …, d-1} where d is the shape of the estimand.
budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.
effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.
n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.
w (ndarray, optional) – Sample weights for the labeled data set.
- Returns:
- Dictionary containing the following items
n (int): Optimal number of gold-labeled samples.
N (int): Optimal number of unlabeled samples.
cost (float): Total cost.
effective_n (int): Effective sample size as defined in`[BHvL24] <https://osf.io/preprints/socarxiv/j3bnt>`__
ppi_corr (float): PPI correlation as defined in [BHvL24]
- Return type:
dict
Notes
At least one of budget and effective_n must be provided. If both are provided, budget will be used.
[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt
- ppi_py.ppi_poisson_power(X, Y, Yhat, cost_X, cost_Y, cost_Yhat, coord, budget=None, effective_n=None, n_max=None, w=None)[source]¶
Computes the optimal pair of sample sizes for estimating Poisson regression coefficients with PPI.
- Parameters:
X (ndarray) – Covariates corresponding to the gold-standard labels.
Y (ndarray) – Gold-standard labels.
Yhat (ndarray) – Predictions corresponding to the gold-standard labels.
cost_X (float) – Cost per unlabeled data point.
cost_Y (float) – Cost per gold-standard label.
cost_Yhat (float) – Cost per prediction.
coord (int) – Coordinate to perform power analysis on. Must be in {0, …, d-1} where d is the shape of the estimand.
budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.
effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.
n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.
w (ndarray, optional) – Sample weights for the labeled data set.
- Returns:
- Return type:
dict
Notes
At least one of budget and effective_n must be provided. If both are provided, budget will be used.
[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt