The Machine Learning Forum

CB Login

CB Workflows

CB Online

None
Active learning PDF Print E-mail
User Rating: / 0
PoorBest 
Written by Sanjoy Dasgupta   
Sunday, 12 July 2009 16:36

Active Learning

The term active learning applies to a wide range of situations in which a learner is able to exert some control over its source of data. For instance, when fitting a regression function, the learner may itself supply a set of data points at which to measure response values, in the hope of reducing the variance of its estimate. Such problems have been studied for many decades under the rubric of experimental design [C72,F72]. More recently, there has been substantial interest within the machine learning community in the specific task of actively learning binary classifiers. This task presents several fundamental statistical and algorithmic challenges, and an understanding of its mathematical underpinnings is only gradually emerging.

 

Learning from labeled and unlabeled data

In the machine learning literature, the task of learning a classifier has traditionally been studied in the framework of supervised learning. This paradigm assumes that there is a training set consisting of data points x (from some set X) and their labels y (from some set Y), and the goal is to learn a function f: X → Y that will accurately predict the labels of data points arising in the future. Over the past fifty years, tremendous progress has been made in resolving many of the basic questions surrounding this model, such as ``how many training points are needed to learn an accurate classifier?''

Although this framework is now fairly well understood, it is a poor fit for many modern learning tasks because of its assumption that all training points automatically come labeled. In practice, it is frequently the case that the raw, abundant, easily-obtained form of data is unlabeled, whereas labels must be explicitly procured and are expensive. In such situations, the reality is that the learner starts with a large pool of unlabeled points and must then strategically decide which ones it wants labeled: how best to spend its limited budget.

Example: Speech recognition. When building a speech recognizer, the unlabeled training data consists of raw speech samples, which are very easy to collect: just walk around with a microphone. For all practical purposes, an unlimited quantity of such samples can be obtained. On the other hand, the ``label'' for each speech sample is a segmentation into its constituent phonemes, and producing even one such label requires substantial human time and attention. Over the past decades, research labs and the government have expended an enormous amount of money, time, and effort on creating labeled datasets of English speech. This investment has paid off, but our ambitions are inevitably moving past what these datasets can provide: we would now like, for instance, to create recognizers for other languages, or for English in specific contexts. Is there some way to avoid more painstaking years of data labeling, to somehow leverage the easy availability of raw speech so as to significantly reduce the number of labels needed? This is the hope of active learning.

Some early results on active learning were in the membership query model, where the data is assumed to be separable (that is, some hypothesis perfectly classifies all points) and the learner is allowed to query the label of any point in the input space X (rather than being constrained to a prespecified unlabeled set), with the goal of eventually returning the perfect hypothesis h*. There is a significant body of beautiful theoretical work in this model [A02], but early experiments ran into some telling difficulties. One study [BL92] found that when training a neural network for handwritten digit recognition, the queries synthesized by the learner were such bizarre and unnatural images that they were impossible for a human to classify. In such contexts, the membership query model is of limited practical value; nonetheless, many of the insights obtained from this model carry over to other settings [H07a].

We will fix as our standard model one in which the learner is given a source of unlabeled data, rather than being able to generate these points himself. Each point has an associated label, but the label is initially hidden, and there is a cost for revealing it. The hope is that an accurate classifier can be found by querying just a few labels, much fewer than would be required by regular supervised learning.

How can the learner decide which labels to probe? One option is to select the query points at random, but it is not hard to show that this yields the same label complexity as supervised learning. A better idea is to choose the query points adaptively: for instance, start by querying some random data points to get a rough sense of where the decision boundary lies, and then gradually refine the estimate of the boundary by specifically querying points in its immediate vicinity. In other words, ask for the labels of data points whose particular positioning makes them especially informative. Such strategies certainly sound good, but can they be fleshed out into practical algorithms? And if so, do these algorithms work well in the sense of producing good classifiers with fewer labels than would be required by supervised learning?

On account of the enormous practical importance of active learning, there are a wide range of algorithms and techniques already available, most of which resemble the aggressive, adaptive sampling strategy just outlined, and many of which show promise in experimental studies. However, a big problem with this kind of sampling is that very quickly, the set of labeled points no longer reflects the underlying data distribution. This makes it hard to show that the classifiers learned have good statistical properties (for instance, that they converge to an optimal classifier in the limit of infinitely many labels). There has recently been considerable progress on developing active learning algorithms that have good statistical properties as well as good bounds on label complexity; see the references below.


References

[A02] D. Angluin. Queries revisited. Proceedings of the Twelfth International Conference on Algorithmic Learning Theory, 12--31, 2001.

[BBL06] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. International Conference on Machine Learning, 2006.

[BDL09] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. International Conference on Machine Learning, 2009.

[BL92] E.B. Baum and K. Lang. Query learning can work poorly when a human oracle is used. International Joint Conference on Neural Networks, 1992.

[CGZ06] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7:1205-1230, 2006.

[CAL94] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994.

[C72] H. Chernoff. Sequential analysis and optimal design. CBMS-NSF Regional Conference Series in Applied Mathematics 8. SIAM, 1972.

[D05] S. Dasgupta. Coarse sample complexity bounds for active learning. Advances in Neural Information Processing Systems, 2005.

[DHM07] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. Advances in Neural Information Processing Systems, 2007.

[F72] V.V. Fedorov. Theory of optimal experiments. Translated by W.J.~Studden and E.M.~Klimko. Academic Press, 1972.

[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning Journal, 28:133--168, 1997.

 

[GNT05] R. Gilad-Bachrach, A. Navot, and N. Tishby. Query by committeee made real. Advances in Neural Information Processing Systems, 2005.

[H07a] S. Hanneke. Teaching dimension and the complexity of active learning. Conference on Learning Theory, 2007.

[H07b] S. Hanneke. A bound on the label complexity of agnostic active learning. International Conference on Machine Learning, 2007.

 

Comments (3)
  • yoavfreund  - An application of active learning to pedestrian re
    A few years ago I did some work with Yotam Abramson (Now in the Technion) on using active learning to build a pedestrian detector. The work was never published, but you can read the tech report, the link is below.

    Yoav Freund

    Active Learning for Visual Object Recognition.
    Yotam Abramson and Yoav Freund. / UCSD Technical report
    http://cseweb.ucsd.edu/~yfreund/papers/sevillePaper.pdf
  • jayant
    While reading this article one point which immediately struck my mind was that even though we are able to come up with good features in most of the vision problems, in the traditional setting of Supervised learning we are not able to use its power fully. I mean we only use features to learn the parameters of a learning algorithm, but potentially either having apriori knowledge or learning from very few labeled data we can automatically select/discard many training points. There are many things to exploit like redundancy, selection of boundary points etc. I think this topic will become central problem in machine learning and vision soon, as day by day the size of unlabeled data is growing exponentially and the variations is also increasing. Also this mimics human learning system whose "parameters" evolves with time.
    Currently I am also working on some aspects of this topic and quite interested in it.
    Thanks for this nice overview.
  • skinnymaggie  - silver tiffany
    All trades, arts, and handiworks have gained by division of labour, namely, when, instead of one man doing everything, each confines himself to a certain kind of work distinct from others in the treatment it requires,Silver Tiffany so as to be able to perform it with greater facility and in the greatest perfection. Tiffany JewelryWhere the different kinds of work are not distinguished and divided, where everyone is a jack-of-all-trades, Tiffany Braceletsthere manufactures remain still in the greatest barbarism. Tiffany NecklacesIt might deserve to be considered whether pure philosophy in all its parts Tiffany Earringsdoes not require a man specially devoted to it, and whether it would not be better for the whole business of science if those who, to please the tastes of the public, are wont to blend the rational and empirical elements together, mixed in all sorts of proportions unknown to themselves, and who call themselves independent thinkers, giving the name of minute philosophers...
Only registered users can write comments!

!joomlacomment 4.0 Copyright (C) 2009 Compojoom.com . All rights reserved."

Last Updated on Tuesday, 14 July 2009 20:51