Dataset Shift via Class Probabilities

Test for no adverse shift via class probabilities for two-sample comparison. The scores are out-of-bag predictions from random forests with the package ranger. The prefix cp stands for class probability, whether the instance belongs to the training or test set. The probability of belonging to the test set is the relevant notion of outlyingness.

cp_ss(x_train, x_test, sub_ratio = 1/2, R = 1000, num_trees = 500)

Arguments

x_train	Training sample.
x_test	Test sample.
sub_ratio	Subsampling ratio for sample splitting. May be ignored.
R	The number of permutations. May be ignored.
num_trees	The number of trees in random forests.

Value

A named list or object of class outlier.test containing:

statistic: observed WAUC statistic
seq_mct: sequential Monte Carlo test, if applicable
p_value: p-value
outlier_scores: outlier scores from training and test set

Details

This approach uses sample splitting to compute the p-value for inference. sub_ratio splits each sample into two parts: one half for estimation (calibration) and the half for inference, if sub_ratio is 1/2. In other words, it sacrifices some predictive accuracy for inferential robustness as in Rinaldo et al. (2019). The suffix ss refers to sample splitting. Sample splitting relies on the asymptotic null distribution for the weighted AUC (WAUC), the test statistic. Li & Fine (2010) derives its null distribution.

Notes

Please see references for the classifier two-sample test, the inspiration behind this approach. Note that Ciemencon et al. (2009) uses both sample splitting for inference and the AUC, rather than the WAUC. Most supervised method for binary classification can replace random forests, the default in this implementation.

References

Kamulete, V. M. (2021). Test for non-negligible adverse shifts. arXiv preprint arXiv:2107.02990.

Ciemencon, S., Depecker, M., & Vayatis, N. (2009, December). AUC optimization and the two-sample problem. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (pp. 360-368).

Lopez-Paz, D., & Oquab, M. (2016). Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545.

Friedman, J. (2004). On multivariate goodness-of-fit and two-sample testing.

Gandy, A. (2009). Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488), 1504-1511.

Li, J., & Fine, J. P. (2010). Weighted area under the receiver operating characteristic curve and its application to gene selection. Journal of the Royal Statistical Society: Series C (Applied Statistics), 59(4), 673-692.

Rinaldo, A., Wasserman, L., & G'Sell, M. (2019). Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Annals of Statistics, 47(6), 3438-3469.

Examples

# \donttest{
library(dsos)
set.seed(12345)
data(iris)
x_train <- iris[1:50, 1:4] # Training sample: Species == 'setosa'
x_test <- iris[51:100, 1:4] # Test sample: Species == 'versicolor'
iris_test <- cp_at(x_train, x_test) # Can also use: cp_ss and cp_pt
str(iris_test)
#> List of 3
#>  $ statistic     : num 1
#>  $ p_value       : num 0
#>  $ outlier_scores:List of 2
#>   ..$ train: num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ test : num [1:50] 1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "class")= chr "outlier.test"
# }