scR

scR estimates empirical sample complexity bounds for supervised learning tasks. The core workflow is:

  1. estimate resampled generalization curves with estimate_accuracy();
  2. fit/extrapolate those curves with interpolate_scb(); and
  3. summarize or plot the estimated sample complexity bound.

Basic use

library(scR)

mylogit <- function(formula, data) {
  structure(
    glm(formula = formula, data = data, family = binomial(link = "logit")),
    class = c("svrclass", "glm")
  )
}

mypred <- function(m, newdata) {
  p <- predict.glm(m, newdata, type = "response")
  factor(ifelse(p > 0.5, 1, 0), levels = c("0", "1"))
}

# In applied work, pass your observed data instead of generating synthetic data.
dat <- gendata(mylogit, dim = 3, maxn = 250, predictfn = mypred)

results <- estimate_accuracy(
  y ~ .,
  mylogit,
  data = dat,
  predictfn = mypred,
  nsample = 10,
  steps = 25,
  parallel = FALSE,
  backend = "sequential"
)

scbhat <- interpolate_scb(
  list(results),
  epsilon = 0.05,
  delta = 0.05,
  maxN = nrow(dat)
)

summary(scbhat)
plot(scbhat, list(results), plot_type = "Delta")

Optional monotone Gaussian process extrapolation

The package also includes the monotone-integrated Gaussian process extrapolator used in the paper appendix. This is an optional nonparametric robustness check. It requires a working CmdStan installation plus the cmdstanr and posterior packages. These are not hard dependencies of scR, so the core package can be installed and checked without a Stan toolchain.

# Requires cmdstanr, posterior, and CmdStan.
gp_delta <- interpolate_scb_gp(
  results,
  epsilon = 0.05,
  delta = 0.05,
  maxN = nrow(dat),
  curve = "delta",
  M_grid = 80
)

summary(gp_delta)
plot(gp_delta, plot_type = "Delta")

The GP implementation uses the paper’s monotone-integrated construction: a Gaussian process is placed on an unconstrained latent field, a softplus transform produces a nonnegative derivative, the derivative is integrated on a fixed grid, and the resulting latent curve is mapped to either the delta or epsilon mean curve.