This function runs the Knockoff procedure, selecting variables relevant for predicting the outcome of interest using knockoffs and test statistics.

knockoff.filter(
  X,
  y,
  Xk = NULL,
  statistic = stat.glmnet_coefdiff,
  aggregate = agg_Freq,
  fdr = 0.1,
  offset = 1,
  verbose = FALSE,
  ...
)

Arguments

X

A numeric n-by-p matrix or data frame of predictors.

y

A response vector of length n.

Xk

A list of Knockoff copys.

statistic

A function to compute test statistics (default: stat.glmnet_coefdiff).

aggregate

Function to aggregate results from multiple knockoffs (default: agg_Freq).

fdr

Target false discovery rate (default: 0.1).

offset

Offset for threshold computation (0 or 1; default: 1).

verbose

Logical; if TRUE, prints progress messages during knockoff generation and statistic calculation (default: FALSE).

...

Additional arguments passed to the statistic function.

Value

An object of class knockoff.filter, containing:

  • Single Knockoff Case:

    • call: The matched call of the function.

    • W: The test statistics for the original variables.

    • threshold: The computed selection threshold.

    • shat: The indices of variables selected based on the threshold.

  • Multiple Knockoff Case:

    • call: The matched call of the function.

    • shat: The aggregated indices of selected variables.

    • Ws: The matrix of test statistics for multiple knockoff copies.

    • thresholds: A vector of thresholds for each knockoff.

    • shat_list: A list where each element contains the indices of selected variables for a corresponding knockoff copy.

    • shat_mat: A binary matrix where each row indicates the selected variables for a specific knockoff copy (1 for selected, 0 for not selected).

References

Candes, E., Fan, Y., Janson, L., & Lv, J. (2018). Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3), 551-577.

Examples

# Linear Regression
set.seed(2024)
n=80; p=100; k=10; Ac = 1:k; Ic = (k+1):p
X = generate_X(n=n,p=p)
y <- generate_y(X, p_nn=k, a=3)
Xk = create.shrink_Gaussian(X = X, n_ko = 10)
res1 = knockoff.filter(X, y, Xk, statistic = stat.glmnet_coefdiff,
                       offset = 1, fdr = 0.1)
res1
#> Call:
#> knockoff.filter(X = X, y = y, Xk = Xk, statistic = stat.glmnet_coefdiff, 
#>     fdr = 0.1, offset = 1)
#> 
#> Selected variables:
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> Frequency of selected variables from 10 knockoff copys:
#>   [1] 10 10 10 10 10  8 10 10  9 10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#>  [26]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#>  [51]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#>  [76]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
perf_eval(res1$shat,Ac,Ic)
#> [1] 1 0

# Logistic Regression
lp <- generate_lp(X, p_nn=k, a=3)
pis <- plogis(lp)
Y <- factor(rbinom(n, 1, pis))
res2 = knockoff.filter(X, Y, Xk, statistic = stat.glmnet_coefdiff,
                       family = 'binomial', offset = 0, fdr = 0.2)
res2
#> Call:
#> knockoff.filter(X = X, y = Y, Xk = Xk, statistic = stat.glmnet_coefdiff, 
#>     fdr = 0.2, offset = 0, family = "binomial")
#> 
#> Selected variables:
#> [1]  1  2  4  9 35 77 90
#> 
#> Frequency of selected variables from 10 knockoff copys:
#>   [1]  9 10  2  6  0  0  0  0 10  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
#>  [26]  0  0  0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#>  [51]  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0
#>  [76]  0  8  0  0  0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  1  3  0  0  0
perf_eval(res2$shat,Ac,Ic)
#> [1] 0.4000000 0.4285714