|
Active Learning
The term active learning applies to a wide range of situations in which a learner is able to exert some control over its source of data. For instance, when fitting a regression function, the learner may itself supply a set of data points at which to measure response values, in the hope of reducing the variance of its estimate. Such problems have been studied for many decades under the rubric of experimental design [C72,F72]. More recently, there has been substantial interest within the machine learning community in the specific task of actively learning binary classifiers. This task presents several fundamental statistical and algorithmic challenges, and an understanding of its mathematical underpinnings is only gradually emerging.
Learning from labeled and unlabeled data
In the machine learning literature, the task of learning a classifier has traditionally been studied in the framework of supervised learning. This paradigm assumes that there is a training set consisting of data points x (from some set X) and their labels y (from some set Y), and the goal is to learn a function f: X → Y that will accurately predict the labels of data points arising in the future. Over the past fifty years, tremendous progress has been made in resolving many of the basic questions surrounding this model, such as ``how many training points are needed to learn an accurate classifier?''
Although this framework is now fairly well understood, it is a poor fit for many modern learning tasks because of its assumption that all training points automatically come labeled. In practice, it is frequently the case that the raw, abundant, easily-obtained form of data is unlabeled, whereas labels must be explicitly procured and are expensive. In such situations, the reality is that the learner starts with a large pool of unlabeled points and must then strategically decide which ones it wants labeled: how best to spend its limited budget.
Example: Speech recognition. When building a speech recognizer, the unlabeled training data consists of raw speech samples, which are very easy to collect: just walk around with a microphone. For all practical purposes, an unlimited quantity of such samples can be obtained. On the other hand, the ``label'' for each speech sample is a segmentation into its constituent phonemes, and producing even one such label requires substantial human time and attention. Over the past decades, research labs and the government have expended an enormous amount of money, time, and effort on creating labeled datasets of English speech. This investment has paid off, but our ambitions are inevitably moving past what these datasets can provide: we would now like, for instance, to create recognizers for other languages, or for English in specific contexts. Is there some way to avoid more painstaking years of data labeling, to somehow leverage the easy availability of raw speech so as to significantly reduce the number of labels needed? This is the hope of active learning.
Some early results on active learning were in the membership query model, where the data is assumed to be separable (that is, some hypothesis perfectly classifies all points) and the learner is allowed to query the label of any point in the input space X (rather than being constrained to a prespecified unlabeled set), with the goal of eventually returning the perfect hypothesis h*. There is a significant body of beautiful theoretical work in this model [A02], but early experiments ran into some telling difficulties. One study [BL92] found that when training a neural network for handwritten digit recognition, the queries synthesized by the learner were such bizarre and unnatural images that they were impossible for a human to classify. In such contexts, the membership query model is of limited practical value; nonetheless, many of the insights obtained from this model carry over to other settings [H07a].
We will fix as our standard model one in which the learner is given a source of unlabeled data, rather than being able to generate these points himself. Each point has an associated label, but the label is initially hidden, and there is a cost for revealing it. The hope is that an accurate classifier can be found by querying just a few labels, much fewer than would be required by regular supervised learning.
How can the learner decide which labels to probe? One option is to select the query points at random, but it is not hard to show that this yields the same label complexity as supervised learning. A better idea is to choose the query points adaptively: for instance, start by querying some random data points to get a rough sense of where the decision boundary lies, and then gradually refine the estimate of the boundary by specifically querying points in its immediate vicinity. In other words, ask for the labels of data points whose particular positioning makes them especially informative. Such strategies certainly sound good, but can they be fleshed out into practical algorithms? And if so, do these algorithms work well in the sense of producing good classifiers with fewer labels than would be required by supervised learning?
On account of the enormous practical importance of active learning, there are a wide range of algorithms and techniques already available, most of which resemble the aggressive, adaptive sampling strategy just outlined, and many of which show promise in experimental studies. However, a big problem with this kind of sampling is that very quickly, the set of labeled points no longer reflects the underlying data distribution. This makes it hard to show that the classifiers learned have good statistical properties (for instance, that they converge to an optimal classifier in the limit of infinitely many labels). There has recently been considerable progress on developing active learning algorithms that have good statistical properties as well as good bounds on label complexity; see the references below.
References
[A02] D. Angluin. Queries revisited. Proceedings of the Twelfth International Conference on Algorithmic Learning Theory, 12--31, 2001.
[BBL06] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. International Conference on Machine Learning, 2006.
[BDL09] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. International Conference on Machine Learning, 2009.
[BL92] E.B. Baum and K. Lang. Query learning can work poorly when a human oracle is used. International Joint Conference on Neural Networks, 1992.
[CGZ06] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7:1205-1230, 2006.
[CAL94] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994.
[C72] H. Chernoff. Sequential analysis and optimal design. CBMS-NSF Regional Conference Series in Applied Mathematics 8. SIAM, 1972.
[D05] S. Dasgupta. Coarse sample complexity bounds for active learning. Advances in Neural Information Processing Systems, 2005.
[DHM07] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. Advances in Neural Information Processing Systems, 2007.
[F72] V.V. Fedorov. Theory of optimal experiments. Translated by W.J.~Studden and E.M.~Klimko. Academic Press, 1972.
[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning Journal, 28:133--168, 1997.
[GNT05] R. Gilad-Bachrach, A. Navot, and N. Tishby. Query by committeee made real. Advances in Neural Information Processing Systems, 2005.
[H07a] S. Hanneke. Teaching dimension and the complexity of active learning. Conference on Learning Theory, 2007.
[H07b] S. Hanneke. A bound on the label complexity of agnostic active learning. International Conference on Machine Learning, 2007.
|
Yoav Freund
Active Learning for Visual Object Recognition.
Yotam Abramson and Yoav Freund. / UCSD Technical report
http://cseweb.ucsd.edu/~yfreund/papers/sevillePaper.pdf