Test for no adverse shift via isolation scores for two-sample comparison. The scores are predictions from extended isolation forest with the package isotree. The prefix od stands for outlier detection, the relevant notion of outlyingness.

od_pt(x_train, x_test, R = 1000, num_trees = 500, sub_ratio = 1/2)

Arguments

x_train

Training sample.

x_test

Test sample.

R

The number of permutations. May be ignored.

num_trees

The number of trees in random forests.

sub_ratio

Subsampling ratio for sample splitting. May be ignored.

Value

A named list or object of class outlier.test containing:

  • statistic: observed WAUC statistic

  • seq_mct: sequential Monte Carlo test, if applicable

  • p_value: p-value

  • outlier_scores: outlier scores from training and test set

Details

The empirical null distribution uses R permutations to estimate the p-value. For speed, this is implemented as a sequential Monte Carlo test with the simctest package. See Gandy (2009) for details. The suffix pt refers to permutation test. It does not use the asymptotic (theoretical) null distribution for the weighted AUC (WAUC), the test statistic. This is the recommended approach for small samples.

Notes

Isolation forest detects isolated points, instances that are typically out-of-distribution relative to the high-density regions of the data distribution. Any performant method for density-based out-of-distribution detection can replace isolation forest, the default in this implementation.

References

Kamulete, V. M. (2021). Test for non-negligible adverse shifts. arXiv preprint arXiv:2107.02990.

Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining (pp. 413-422). IEEE.

Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.

Li, J., & Fine, J. P. (2010). Weighted area under the receiver operating characteristic curve and its application to gene selection. Journal of the Royal Statistical Society: Series C (Applied Statistics), 59(4), 673-692.

Examples

# \donttest{ library(dsos) set.seed(12345) data(iris) x_train <- iris[1:50, 1:4] # Training sample: Species == 'setosa' x_test <- iris[51:100, 1:4] # Test sample: Species == 'versicolor' iris_test <- od_pt(x_train, x_test) # Can also use: od_ss str(iris_test)
#> List of 4 #> $ seq_mct :Formal class 'sampalgontheflyres' [package "simctest"] with 10 slots #> .. ..@ porig : num [1:39] 1.83e-05 8.58e-05 2.58e-04 6.39e-04 1.40e-03 ... #> .. ..@ U : int [1:500] 3 4 5 5 5 5 6 6 6 6 ... #> .. ..@ L : int [1:500] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... #> .. ..@ ind : num 172 #> .. ..@ preverr: num [1:2] 0.00032 0.000332 #> .. ..@ p.value: num 0 #> .. ..@ steps : int 173 #> .. ..@ pos : num 0 #> .. ..@ alg :Formal class 'sampalgonthefly' [package "simctest"] with 1 slot #> .. .. .. ..@ internal:<environment: 0x000001d2d43edef8> #> .. ..@ gen :function () #> .. .. ..- attr(*, "srcref")= 'srcref' int [1:8] 6 14 9 3 14 3 6 9 #> .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x000001d2d753abe0> #> $ statistic : num 0.343 #> $ p_value : num 0 #> $ outlier_scores:List of 2 #> ..$ train: num [1:50] 0.373 0.397 0.405 0.411 0.383 ... #> ..$ test : num [1:50] 0.689 0.691 0.689 0.681 0.688 ... #> - attr(*, "class")= chr "outlier.test"
# }