Skip to contents

Introduction

HOIF is an R package for the implementation of Higher-Order Influence Function (HOIF) estimators for the Average Treatment Effect (ATE). The methodology is based on a series of foundational works by James M. Robins and his collaborators [1–4].

The overall algorithmic workflow, mathematical formulas, and all parameters of HOIF are illustrated in HOIF.pdf (on GitHub) (after installation: system.file("extdoc", "HOIF.pdf", package = "HOIF")) and in the package vignette — both can be used as references.


The Core Computation

A core computational component of HOIF is the evaluation of higher-order U-statistics.

We have developed a general algorithm for computing U-statistics using the powerful Python functions numpy.einsum and torch.einsum.

We have built a Python package, u-stats (source: U-Statistics-python), along with an R interface provided by ustats. Using ustats, we developed this R package for HOIF estimation of the ATE.

We also analyze computational complexity using graph-theoretic tools in a dedicated paper focused on the exact computation of higher-order U-statistics [5].

For HOIF estimators of the ATE, a key takeaway is:

  • when the order m7m \le 7, the computational complexity is O(n3+nk2+k3+n2k)O(n^3 + nk^2 + k^3 + n^2 k)
  • when the order m>7m > 7, the computational complexity exceeds O(n4+nk2+k3+n2k)O(n^4 + nk^2 + k^3 + n^2 k)

Here, nn is the sample size and kk is the user-defined dimension of the transformed covariates XX. For more details on computing the higher-order U-statistics arising in HOIF, see Section 4.1 of [5].


Installation

# From CRAN:
install.packages("HOIF")

# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("cxy0714/HOIF")

Setting up the Python backend

By default, the higher-order U-statistics are computed in Python through the ustats package, which needs Python (>= 3.11) with the Python packages u-stats, numpy, and torch. There are three ways to get them — pick the one that fits you:

With reticulate (>= 1.41), the Python dependencies are declared by ustats and provisioned automatically the first time Python is needed. Just call hoif_ate():

library(HOIF)
results <- hoif_ate(...)   # first call sets up Python automatically

The downloaded environment is cached and reused across sessions.

Note: the first call downloads PyTorch. On Linux the default build bundles CUDA libraries (~2.5 GB); if you prefer a small CPU-only build or want full control, use Option 2 or 3.

Option 2 — One-shot managed setup: ustats::setup_ustats()

Creates a persistent environment and installs all dependencies. By default it installs the CPU-only build of PyTorch (~200 MB instead of ~2.5 GB):

ustats::setup_ustats()              # CPU-only PyTorch (default)
ustats::setup_ustats(gpu = TRUE)    # default PyPI PyTorch (CUDA on Linux)

Option 3 — Use your own Python/conda environment

If you already have an environment with a configured PyTorch (e.g. a specific CUDA version), just add u-stats:

pip install u-stats

and point reticulate to that environment before Python initializes:

library(HOIF)
reticulate::use_condaenv("your_env_name", required = TRUE)  # or use_virtualenv()
# Alternatively, set the RETICULATE_PYTHON environment variable
# (e.g. in .Rprofile or .Renviron) to the path of your python binary.

Verify the setup

Whichever option you chose:

ustats::check_ustats_setup()
#> === ustats Environment Status ===
#> [OK] Python: /path/to/python
#> [OK] u_stats available
#> [OK] NumPy available
#> [OK] PyTorch available (version 2.x, CUDA available)

See vignette("ustats", package = "ustats") for a complete installation guide and troubleshooting tips.

No Python at all?

Pass pure_R_code = TRUE to hoif_ate(): the U-statistics are then computed by a pure R implementation (orders up to 6) and no Python runtime is required:

results <- hoif_ate(X, A, Y, mu1 = mu1, mu0 = mu0, pi = pi,
                    m = 6, pure_R_code = TRUE)

Example

Before the code, a note on sample splitting. Conceptually, HOIF estimation involves three estimation tasks, and ideally each uses its own, independent part of the data:

  1. estimating the nuisance functions μ̂(1,X),μ̂(0,X),π̂(X)\hat{\mu}(1, X), \hat{\mu}(0, X), \hat{\pi}(X);
  2. estimating the inverse weighted Gram matrix Ω̂a\hat{\Omega}^a;
  3. computing the higher-order U-statistics.

This package does not implement task 1: hoif_ate() only takes the predicted values (mu1, mu0, pi) as inputs, so the overall three-way cross-fitting is left to the user. The sample_split argument controls only the split between tasks 2 and 3:

  • sample_split = TRUE (eHOIF): cross-fits the Gram matrix against the U-statistics over n_folds folds and averages the results;
  • sample_split = FALSE (sHOIF): computes both on the same sample, without distinction.

The example below shows the recommended workflow. The data are first split in half: the nuisance functions are fitted on one half (task 1) and hoif_ate() is run only on the other half (tasks 2–3), so the nuisance predictions are independent of the estimation sample. Within the estimation half, eHOIF then additionally cross-fits the Gram matrix against the U-statistics, while sHOIF uses the estimation half as is.

A more comprehensive example can be found in test_manual/manual_test.R.

library(HOIF)

set.seed(123)

n <- 2000   # Total sample size
p <- 50     # Number of covariates

# Covariates
X <- matrix(rnorm(n * p), ncol = p)

# Binary treatment assignment (randomized)
A <- rbinom(n, 1, 0.5)

# True coefficient vector (normalized)
beta <- runif(p)
beta <- beta / sqrt(as.numeric(crossprod(beta)))

# Outcome: partially linear model
Y <- as.numeric(A + X %*% beta + rnorm(n, 0, 0.1))


### Step 1: Split the sample ---------------------------------------------
### One half (the "nuisance sample") is used only to fit the nuisance
### regressions; the other half (the "estimation sample") is used only by
### hoif_ate(). This keeps the nuisance predictions independent of the
### sample on which the HOIF estimator is computed.

idx_nuis <- sample(n, n / 2)
idx_est  <- setdiff(seq_len(n), idx_nuis)

X_nuis <- X[idx_nuis, , drop = FALSE]
A_nuis <- A[idx_nuis]
Y_nuis <- Y[idx_nuis]

X_est <- X[idx_est, , drop = FALSE]
A_est <- A[idx_est]
Y_est <- Y[idx_est]


### Step 2: Estimate nuisance outcome regressions on the nuisance sample
### and predict them on the estimation sample (here we omit the estimation
### of the propensity score by using the known randomization probability)

# - mu1(X) = E[Y | A = 1, X]
# - mu0(X) = E[Y | A = 0, X]

library(glmnet)

cv_fit1 <- cv.glmnet(X_nuis[A_nuis == 1, , drop = FALSE],
                     Y_nuis[A_nuis == 1], alpha = 1)
cv_fit0 <- cv.glmnet(X_nuis[A_nuis == 0, , drop = FALSE],
                     Y_nuis[A_nuis == 0], alpha = 1)

mu1 <- as.vector(predict(cv_fit1, newx = X_est, s = "lambda.min"))
mu0 <- as.vector(predict(cv_fit0, newx = X_est, s = "lambda.min"))

# Known propensity score (randomized treatment)
pi <- rep(0.5, length(idx_est))


### Step 3: eHOIF -- cross-fit the Gram matrix vs the U-statistics --------
### Within the estimation sample, hoif_ate() splits the data into n_folds
### folds: for each fold, the inverse weighted Gram matrix is estimated on
### the other folds, the U-statistics are computed on the held-out fold,
### and the results are averaged.

m <- 7
n_folds <- 2

results_split <- hoif_ate(
  X_est, A_est, Y_est,
  mu1 = mu1,
  mu0 = mu0,
  pi = pi,
  transform_method = "none",
  m = m,
  sample_split = TRUE,
  n_folds = n_folds,
  seed = 123,
  backend = "torch"
)

cat("\nResults (eHOIF, with sample splitting):\n")
print(results_split)
plot(results_split)


### Step 4: sHOIF -- Gram matrix and U-statistics on the same sample ------
### Still uses the independent nuisance predictions from Step 2; only the
### split between the Gram matrix and the U-statistics is dropped.

results_no_split <- hoif_ate(
  X_est, A_est, Y_est,
  mu1 = mu1,
  mu0 = mu0,
  pi = pi,
  transform_method = "none",
  m = m,
  sample_split = FALSE,
  backend = "torch"
)

cat("\nResults (sHOIF, without sample splitting):\n")
print(results_no_split)
plot(results_no_split)

References

1 Robins, J., Li, L., Tchetgen Tchetgen, E., & van der Vaart, A. (2008). Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and Statistics: Essays in Honor of David A. Freedman, 335–421.

2 Robins, J. M., Li, L., Mukherjee, R., Tchetgen Tchetgen, E., & van der Vaart, A. (2017). Minimax estimation of a functional on a structured high-dimensional model. The Annals of Statistics, 45(5), 1951–1987.

3 Liu, L., Mukherjee, R., Newey, W. K., & Robins, J. M. (2017). Semiparametric efficient empirical higher order influence function estimators. arXiv:1705.07577.

4 Liu, L., & Li, C. (2023). New √n-consistent, numerically stable higher-order influence function estimators. arXiv:2302.08097.

5 Chen, X., Zhang, R., & Liu, L. (2025). On computing and the complexity of computing higher-order U-statistics, exactly. arXiv:2508.12627.