Prior Shift Using the Ratio Estimator

Abstract

Several machine learning applications use classifiers as a way of quantifying the prevalence of positive class labels in a target dataset, a task named quantification. For instance, a naive a way of determining what proportion of people like a given product with no labeled reviews is to (i) train a classifier based on the Google Shopping reviews to predict whether a user likes a product given its review, and then (ii) apply this classifier to Facebook/Google+ posts about that product. It is well known that such a two-step approach, named Classify and Count, fails because of dataset shift, and thus, several improvements have been recently proposed under an assumption named prior shift. Unfortunately, these methods only explore the relationship between the covariates and the response via classifiers. Moreover, the literature lacks in the theoretical foundation to improve these techniques. We propose a new family of estimators named Ratio Estimator which is able to explore the relationship between the cov ariates and the response using any function g:X→R and not only classifiers. We show that for some choices of g, our estimator matches standard estimators used in the literature. We also explore alternative ways of constructing functions g that lead to estimators with good performance, and compare them using real datasets. Finally, we provide a theoretical analysis of the method.

Publication
In Proceedings of Bayesian Inference and Maximum Entropy Methods in Science and Engineering
Rafael B. Stern
Rafael B. Stern
Professor of Statistics

I am an Assistant Professor at the University of São Paulo. I have a B.A. in Statistics from the University of São Paulo, a B.A. in Law from Pontifícia Universidade Católica in São Paulo, and a Ph.D. in Statistics from Carnegie Mellon University. I am currently a member of the Scientific Council of the Brazilian Association of Jurimetrics, an associate investigator at NeuroMat and a member of the Order of Attorneys of Brazil.