A sequential algorithm to create non-parametric knockoffs based on principal component regression and residual permutation.

create.pc(X, n_ko = 1, ncomp, verbose = FALSE)

Arguments

X

A numeric matrix representing the original design matrix.

n_ko

An integer specifying the number of knockoff matrices to generate. Default is 1.

ncomp

The number of principal components to use in the knockoff generation process.

verbose

Logical. Whether to display progress information during knockoff generation. Default is TRUE.

Value

A list of principal component knockoff matrices. Each matrix corresponds to a generated knockoff.

Details

For each original variable \(\mathbf{x}_j\), where \(j = 1, \ldots, p\), the following steps are performed to generate knockoff variables:

  1. Conduct PCA on the matrix \(\left(\mathbf{X}_{-j}, \mathbf{Z}_{1:j-1}\right)\).

  2. For a fixed \(K\), fit \(\mathbf{x}_j\) on \(K\) principal components (PCs). The choice of \(K\) involves a tradeoff: a larger \(K\) makes the knockoff more similar to the original variable, leading to a lower type I error but weaker power of the test.

  3. Compute the residual vector \(\varepsilon_n = \mathbf{x}_j - \hat{\mathbf{x}}_j\).

  4. Permute \(\varepsilon_n\) randomly and denote the permuted residuals as \(\varepsilon_n^*\).

  5. Set \(\mathbf{z}_j = \hat{\mathbf{x}}_j + \varepsilon_n^*\) and combine it with the current knockoff matrix \(\mathbf{Z}_{1:j-1}\).

References

Jiang, Tao, Yuanyuan Li, and Alison A. Motsinger-Reif. "Knockoff boosted tree for model-free variable selection." Bioinformatics 37.7 (2021): 976-983.

Shen, A. et al. (2019). "False discovery rate control in cancer biomarker selection using knockoffs." Cancers, 11, 744.

Examples

X <- matrix(rnorm(100), nrow = 10)
Xk <- create.pc(X = X, ncomp = 5)