Test for no adverse shift via residuals for multivariate two-sample comparison. The scores are obtained using out-of-bag predictions from random forest with the package ranger to get the residuals. The prefix rd stands for residual diagnostics, the relevant notion of outlier. This test assumes that both training and test sets are labeled.
rd_pt( x_train, x_test, R = 1000, num_trees = 500L, sub_ratio = 1/2, response_name = "label" )
x_train | Training sample. |
---|---|
x_test | Test sample. |
R | The number of permutations. May be ignored. |
num_trees | The number of trees in random forests. |
sub_ratio | Subsampling ratio for sample splitting. May be ignored. |
response_name | The column name of the categorical outcome to predict. |
A named list or object of class outlier.test
containing:
statistic
: observed WAUC statistic
seq_mct
: sequential Monte Carlo test, if applicable
p_value
: p-value
outlier_scores
: outlier scores from training and test set
The empirical null distribution uses R
permutations to estimate
the p-value. For speed, this is implemented as a sequential Monte Carlo test
with the simctest package. See Gandy (2009) for details. The suffix
pt refers to permutation test. It does not use the asymptotic
(theoretical) null distribution for the weighted AUC (WAUC), the test
statistic. This is the recommended approach for small samples.
Residuals traditionally underpin diagnostics (misspecification) tests in supervised learning. For a contemporaneous example of this approach also using machine learning, see see Janková et al. (2020) and references therein.
Kamulete, V. M. (2021). Test for non-negligible adverse shifts. arXiv preprint arXiv:2107.02990.
Janková, J., Shah, R. D., Bühlmann, P., & Samworth, R. J. (2020). Goodness-of-fit testing in high dimensional generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3), 773-795.
Li, J., & Fine, J. P. (2010). Weighted area under the receiver operating characteristic curve and its application to gene selection. Journal of the Royal Statistical Society: Series C (Applied Statistics), 59(4), 673-692.
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.
# \donttest{ library(dsos) set.seed(12345) data(iris) idx <- sample(nrow(iris), 2 / 3 * nrow(iris)) xy_train <- iris[idx, ] xy_test <- iris[-idx, ] iris_test <- rd_pt(xy_train, xy_test, response_name = "Species") str(iris_test)#> List of 4 #> $ seq_mct :Formal class 'sampalgontheflyres' [package "simctest"] with 10 slots #> .. ..@ porig : num [1:55] 8.58e-06 3.24e-05 7.86e-05 1.60e-04 2.99e-04 ... #> .. ..@ U : int [1:500] 47 47 47 47 47 48 48 48 48 48 ... #> .. ..@ L : int [1:500] 7 7 7 8 8 8 8 8 8 8 ... #> .. ..@ ind : num 499 #> .. ..@ preverr: num [1:2] 0.000494 0.000499 #> .. ..@ p.value: num NA #> .. ..@ steps : int 1000 #> .. ..@ pos : num 32 #> .. ..@ alg :Formal class 'sampalgonthefly' [package "simctest"] with 1 slot #> .. .. .. ..@ internal:<environment: 0x000001d2d7fee3d8> #> .. ..@ gen :function () #> .. .. ..- attr(*, "srcref")= 'srcref' int [1:8] 6 14 9 3 14 3 6 9 #> .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x000001d2d753abe0> #> $ statistic : num 0.349 #> $ p_value : num 0.032 #> $ outlier_scores:List of 2 #> ..$ train: num [1:100] 0.01593 0.00299 0.00538 0 0 ... #> ..$ test : num [1:50] 0 0 0 0 0 0 0 0 0 0 ... #> - attr(*, "class")= chr "outlier.test"# }