API Reference for Power Analysis

Documentation for functions implementing power analyses for PPI can be found here.

ppi_py.ppi_power(ppi_corr, cost_X, cost_Y, cost_Yhat, budget=None, effective_n=None, n_max=None)[source]

Computes the optimal pair of sample sizes for PPI when the PPI correlation is known.

Parameters:
  • ppi_corr (float) – PPI correlation as defined in [BHvL24].

  • cost_X (float) – Cost per unlabeled data point.

  • cost_Y (float) – Cost per gold-standard label.

  • cost_Yhat (float) – Cost per prediction.

  • budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.

  • effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.

  • n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy n + N <= n_max.

Returns:

Dictionary containing the following items:
  • n (int): Optimal number of gold-labeled samples.

  • N (int): Optimal number of unlabeled samples.

  • cost (float): Total cost.

  • effective_n (int): Effective number of samples as defined in [BHvL24].

  • ppi_corr (float): PPI correlation as defined in [BHvL24].

Return type:

dict

Notes

At least one of budget and effective_n must be provided. If both are provided, budget will be used and the most powerful pair will be returned.

[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt

ppi_py.ppi_mean_power(Y, Yhat, cost_Y, cost_Yhat, budget=None, effective_n=None, n_max=None, w=None)[source]

Computes the optimal pair of sample sizes for estimating the mean with ppi.

Parameters:
  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • cost_Y (float) – Cost per gold-standard label.

  • cost_Yhat (float) – Cost per prediction.

  • budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.

  • effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.

  • n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.

  • w (ndarray, optional) – Sample weights for the labeled data set. Defaults to all ones vector.

Returns:

Dictionary containing the following items
  • n (int): Optimal number of gold-labeled samples.

  • N (int): Optimal number of unlabeled samples.

  • cost (float): Total cost.

  • effective_n (int): Effective sample size as defined in [BHvL24].

  • ppi_corr (float): PPI correlation as defined in [BHvL24].

Return type:

dict

Notes

At least one of budget and effective_n must be provided. If both are provided, budget will be used and the most powerful pair will be returned.

[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt

ppi_py.ppi_ols_power(X, Y, Yhat, cost_X, cost_Y, cost_Yhat, coord, budget=None, effective_n=None, n_max=None, w=None)[source]

Computes the optimal pair of sample sizes for estimating OLS coefficients with PPI.

Parameters:
  • X (ndarray) – Covariates corresponding to the gold-standard labels.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • cost_X (float) – Cost per unlabeled data point.

  • cost_Y (float) – Cost per gold-standard label.

  • cost_Yhat (float) – Cost per prediction.

  • coord (int) – Coordinate to perform power analysis on. Must be in {0, …, d-1} where d is the shape of the estimand.

  • budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.

  • effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.

  • n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.

  • w (ndarray, optional) – Sample weights for the labeled data set.

Returns:

Dictionary containing the following items
  • n (int): Optimal number of gold-labeled samples.

  • N (int): Optimal number of unlabeled samples.

  • cost (float): Total cost.

  • effective_n (int): Effective sample size as defined in [BHvL24]

  • ppi_corr (float): PPI correlation as defined in [BHvL24]

Return type:

dict

Notes

At least one of budget and effective_n must be provided. If both are provided, budget will be used.

[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt

ppi_py.ppi_logistic_power(X, Y, Yhat, cost_X, cost_Y, cost_Yhat, coord, budget=None, effective_n=None, n_max=None, w=None)[source]

Computes the optimal pair of sample sizes for estimating logistic regression coefficients with PPI.

Parameters:
  • X (ndarray) – Covariates corresponding to the gold-standard labels.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • cost_X (float) – Cost per unlabeled data point.

  • cost_Y (float) – Cost per gold-standard label.

  • cost_Yhat (float) – Cost per prediction.

  • coord (int) – Coordinate to perform power analysis on. Must be in {0, …, d-1} where d is the shape of the estimand.

  • budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.

  • effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.

  • n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.

  • w (ndarray, optional) – Sample weights for the labeled data set.

Returns:

Dictionary containing the following items
  • n (int): Optimal number of gold-labeled samples.

  • N (int): Optimal number of unlabeled samples.

  • cost (float): Total cost.

  • effective_n (int): Effective sample size as defined in`[BHvL24] <https://osf.io/preprints/socarxiv/j3bnt>`__

  • ppi_corr (float): PPI correlation as defined in [BHvL24]

Return type:

dict

Notes

At least one of budget and effective_n must be provided. If both are provided, budget will be used.

[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt

ppi_py.ppi_poisson_power(X, Y, Yhat, cost_X, cost_Y, cost_Yhat, coord, budget=None, effective_n=None, n_max=None, w=None)[source]

Computes the optimal pair of sample sizes for estimating Poisson regression coefficients with PPI.

Parameters:
  • X (ndarray) – Covariates corresponding to the gold-standard labels.

  • Y (ndarray) – Gold-standard labels.

  • Yhat (ndarray) – Predictions corresponding to the gold-standard labels.

  • cost_X (float) – Cost per unlabeled data point.

  • cost_Y (float) – Cost per gold-standard label.

  • cost_Yhat (float) – Cost per prediction.

  • coord (int) – Coordinate to perform power analysis on. Must be in {0, …, d-1} where d is the shape of the estimand.

  • budget (float, optional) – Total budget. Used to compute the most powerful pair given the budget.

  • effective_n (int, optional) – Effective sample size. Used to compute the cheapest pair.

  • n_max (int, optional) – Maximum number of samples allowed. If provided, the optimal pair will satisfy the additional constraint that n + N <= n_max.

  • w (ndarray, optional) – Sample weights for the labeled data set.

Returns:

Dictionary containing the following items
  • n (int): Optimal number of gold-labeled samples.

  • N (int): Optimal number of unlabeled samples.

  • cost (float): Total cost.

  • effective_n (int): Effective sample size as defined in [BHvL24].

  • ppi_corr (float): PPI correlation [BHvL24].

Return type:

dict

Notes

At least one of budget and effective_n must be provided. If both are provided, budget will be used.

[BHvL24] Broska, D., Howes, M., & van Loon, A. (2024, August 22). The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations. https://doi.org/10.31235/osf.io/j3bnt