The Machine Learning Forum

CB Login

CB Workflows

CB Online

None
Information Theory and Machine Learning PDF Print E-mail
Written by Maxim Raginsky   
Thursday, 30 July 2009 17:46

Information Theory and Machine Learning

 

There is a fundamental conceptual affinity between information theory and machine learning. Both are concerned with exploiting regularities and patterns in data sources to accomplish certain tasks in the presence of uncertainty and noise. In the case of information theory, the tasks of interest involve reliable representation, transmission and storage of information; in the case of learning, we are interested in making accurate predictions about the future on the basis of previously seen data.

 

Some connections between information theory and statistical inference are well known. For instance, if Z1, Z2, ..., Zn are i.i.d. samples whose common distribution is known to be either P or Q, then the probability of error of an optimal hypothesis test that decides between these two alternatives (when P is the actual distribution) behaves approximately like exp[-n D(P||Q)], where D(.||.) is the relative entropy (or the Kullback-Leibler divergence). This is known as Stein's lemma. In particular, if each Zi is a pair (Xi,Yi), where Xi and Yi have a given joint distribution PXY, then the probability of error of an optimal hypothesis test that decides whether the n-tuple (X1,Y1), ..., (Xn,Yn) is a sample from PXY or from the product distribution PXPY is exp[-n I(X;Y)], where I(X;Y) is the mutual information between X and Y. For reference on all things information theory I recommend Cover and Thomas [CT06].

There is a vast and varied body of research that has contributed to a deep symbiotic relationship between machine learning and information theory. For example, information-theoretic methods have shed light on the training requirements and on the optimal performance of learning algorithms, while learning-theoretic methods have been exploited fruitfully in the context of adaptive (universal) methods for data compression. There are still many new directions to explore at the interface between machine learning and information theory; I will return to this topic at the end.

In this overview, I will survey some of this work. Needless to say, the choice of topics and papers is by no means complete, and reflects my own preferences and tastes. Nonetheless, I have tried to do justice to all the major research directions and strands in this area.

Sequential prediction

One problem where information-theoretic considerations play a prominent role is sequential prediction. The forecaster observes elements of a sequence one at a time, and at each time step must assign beliefs (or likelihoods) to the next, as yet unseen, symbol given the symbols observed so far.  When the new symbol arrives, the forecaster incurs a loss equal to the negative log-likelihood assigned by his model to this symbol. This problem appears in such contexts as online density estimation, sequential investment and adaptive data compression. A natural information-theoretic interpretation of the log-loss criterion is in terms of optimal codelengths.

This prediction problem comes in two basic flavors. For instance, the forecaster may know a priori that the observed sequence is drawn probabilistically from a class of model distributions; alternatively, it may be an individual (e.g., deterministic) sequence, and the forecaster merely wants to keep making good predictions of the next symbol using a class of models of his choice. A good survey of both approaches, with references to original and related work, is by Merhav and Feder [MF98]; the recent text by Cesa-Bianchi and Lugosi [CL06] is also recommended. In the probabilistic setting it turns out that the coding interpretation of the optimal log-loss is related to the information capacity of a channel whose input space is the space of model distributions and whose output space is the space of observed sequences. This is explored in depth in an excellent paper by Haussler and Opper [HO97]. Recent work by Seeger, Kakade and Foster [SKF08] applies these ideas to sequential prediction using Gaussian processes.

The Minimum Description Length (MDL) principle

 

The sequential prediction problem discussed above has strong ties to the Minimum Description Length (MDL) principle of Rissanen (see, e.g., [Ris86], the survey article of Barron, Rissanen and Yu [BRY98], or the recent book by Grünwald [Grü06]). In particular, the MDL principle serves as a basis for complexity-regularized density estimation [BC91]; in fact, recent work by Barron et al. [BHLL08] has shown that sparse estimation techiques based on l1 regularization (see, e.g., [BTW07]) can be viewed as a special case of MDL. In other words, the l1 penalties that serve as convex surrogates for the combinatorial l0 (subset selction) penalties actually arise from information-theoretic penalties that have a natural codelength interpretation.

Learning with exponential families

 

One of the earliest applications of information theory to statistical inference was to fitting probability models to moment constraints using a maximum-entropy criterion (or minimum KL divergence criterion). This line of research, which dates back to the seminal work of Kullback [Kul68], gives an information-theoretic justification to exponential families of probability distributions. Exponential families play a prominent role in Bayesian machine learning with graphical models, and information-theoretic considerations have motivated a wide variety of techniques for approximate inference and parameter learning in graphical models (see the comprehensive survey of Wainwright and Jordan [WJ08]). In the context of supervised learning, Gupta, Gray and Olshen [GGO06] developed a new approach to kernel classification and regression using a maximum entropy criterion coupled with linear interpolation.

Information-theoretic lower bounds on learning algorithms

 

Let me now pass from density estimation to supervised learning, or learning from examples. In this setting, information-theoretic methods have lead to crisp and general techniques for proving lower bounds on the sample complexity of learning algorithms. For instance, Gentile and Helmbold [GH01] use such basic tools from information theory as the data processing theorem and Fano's inequality to establish lower bounds on the sample complexity of concept learning in a variety of scenarios, such as adversarial (malicious) noise, classification with a noisy teacher and active learning with membership queries. More recently, Zhang [Zha06] has developed an information-theoretic framework for deriving tight upper and lower bounds on the risk of learning algorithms in a very general setting.

Learning-theoretic results for vector quantizers

 

Parallel to the work that uses information-theoretic methods to solve problems motivated by machine learning, there has been a lot of research effort in the opposite direction: using learning-theoretic methods to answer questions motivated by information theory.

Most prominently, this has been the case in the field of lossy source coding and design of vector quantizers. A d-dimensional vector quantizer is a mapping from d-dimensional real vectors into a finite set of vectors (called a codebook). If the input of the quantizer is drawn at random, from some distribution P, we can talk about the expected distortion, i.e., the mean squared error incurred by the quantization mapping. When the source distribution P is unknown, we must resort to empirical quantizer design based on training data. This is closely related to data clustering; in particular, one popular method for empirical quantizer design, known in the data compression community as the Max-Lloyd algorithm, is precisely the k-means algorithm. The paper by Linder, Lugosi and Zeger [LLZ94] uses Vapnik-Chervonenkis concentration inequalities to derive bounds on the sample complexity of empirically designed vector quantizers. The survey of Linder [Lin02] contains more information on learning-theoretic methods in vector quantization. See also the papers [Rag08,Rag09] that draw further connections between learning theory (in particular, Vapnik-Chervonenkis classes) and universal lossy source codes.

Learning under communication constraints

 

Another area where information theory and machine learning intersect is learning under communication constraints. For example, if the source of training data is geographically separated from the learning agent, or if the amount of memory to store the training data is limited, it may be necessary to first compress the training sequence before handing it over to the learning agent. Such compression inevitably leads to loss of information, which may adversely impact the performance of the learning agent. Predd, Kulkarni and Poor [PKP06] considered a remote sensing scenario, where there are n sensors and the ith training sample (Xi,Yi) is held by the ith sensor. The learning agent wishes to predict the label Y for a new feature X, and can broadcast X to all the sensors noiselessly. However, each sensor must respond with a short binary message. They show how this can be accomplished using log2(3) bits per sensor, with consistency guarantees similar to those afforded by Stone's theorem.

Westover and O'Sullivan [WO08] consider learning algorithms with bounded memory and give information-theoretic upper bounds on the number of patterns (or classes) that can be reliably discriminated by such algorithms. Their formulation, inspired by biological models of learning, establishes interesting connections between machine learning and distributed source coding.

Another variant of the problem of learning under communication constraints was recently introduced by Raginsky [Rag09a]. Namely, the training samples are held by the same entity, but the learning agent can only see R bits per sample. The question is: what are the achievable values of the risk of a learning algorithm operating on training data compressed in this way? This work ties together an information-theoretic formulation reminiscent of Shannon's rate-distortion theory with the theory of empirical processes.

Information Bottleneck method and related problems

 

Finally, I will mention the Information Bottleneck method (IB) introduced by Tishby, Pereira and Bialek [TPB99]. The IB is proposed as a principled information-theoretic answer to the problem of feature selection: given a learning task, which features of the observed phenomenon are relevant for predicting a label, and which are not?  An operational interpretation of the IB method in terms of a rate-distortion problem that involves Bregman divergences is given in the paper by Banerjee et al. [BMDG05]. Follow-up work by Harremoës and Tishby [HT07] gives another operational interpretation of IB in terms of distortion measures of the form d(x,x') = D(P(.|x) || P(.|x')), where x and x' are a pair of signals and P(.|x) is a conditional distribution of the label Y given X=x. See also the paper by Lazebnik and Raginsky [LR09] that addresses the problem of using labeled training examples to learn nearest-neighbor quantizers for continuous-valued features that preserve predictively relevant information about the class label of a feature.

Another work that should be mentioned in this connection is an information-theoretic approach to multi-view learning by Sridharan and Kakade [SK08]. They consider the setting where the feature X is partitioned into two components (or views) X1 and X2, and either of the views is by itself sufficient to predict the label Y. This is formalized by requiring that the conditional mutual informations I(Y; X1 | X2) and I(Y; X2 | X1) are both bounded from above by some small number. Among other things, Sridharan and Kakade comment on the connection between multi-view learning and the IB method.

Future directions

 

I have attempted to paint a comprehensive picture of the vast research literature at the interface between information theory and machine learning. I will close by listing two promising directions for future research:

  • Information-theoretic lower bounds for active learning -- Gentile and Helmbold [GH01] gave lower bounds for active learning using membership queries. It would be interesting to explore other active learning models (such as agnostic active learning) from an information-theoretic perspective. In particular, information-theoretic results on communication and inference with feedback can be brought to bear on this problem.

  • Information-theoretic interpretation of the VC dimension and related quantities -- The role of such combinatorial concepts as the Vapnik-Chervonenkis (VC) dimension and related quantities, such as Rademacher complexities, in controlling the generalization capabilities of learning algorithms is well understood from a geometric point of view. It would be interesting to understand these concepts from an information-theoretic point of view as well. For example, the VC dimension could conceivably be connected to some notion of "compressibility" of concept classes relative to stochastic sources of training data. The paper of Blumer et al. [BEHW89] has some results that could be useful.
There are certainly many other exciting research avenues I haven't thought of, and I am looking forward to your comments.

References

[BC91] A.R. Barron and T.M. Cover, Minimum complexity density estimation, IEEE Transactions on Information Theory, vol. 37, no. 4, pp. 1034-1054, 1991
[BEHW89] A. Blumer, A. Ehrenfeucht, D. Haussler and M.K. Warmuth, Learnability and the Vapnik-Chervonenkis dimension, Journal of the ACM, vol. 36, no. 4, pp. 929-965, 1989
[BHLL08] A.R. Barron, C. Huang, J.Q. Li, and X. Luo, The MDL principle, penalized likelihood, and statistical risk, in Festschrift for Jorma Rissanen, 2008
[BMDG05] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh, Clustering with Bregman divergences, Journal of Machine Learning Research, vol. 6, pp. 1705-1749, 2005
[BRY96] A. Barron, J. Rissanen and B. Yu, The Minimum Description Length principle in coding and modeling, IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2743-2760, 1998
[BTW08] F. Bunea, A. Tsybakov and M. Wegkamp, Sparse density estimation with l1 penalties, in Proceedings of 20th Annual Conference on Learning Theory (COLT), 2007
[CL06] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning and Games, Cambridge University Press, 2006
[CT06] T.M. Cover and J.A. Thomas, Elements of Information Theory, 2nd ed., Wiley, 2006
[GGO06] M.R. Gupta, R.M. Gray and R.A. Olshen, Nonparametric supervised learning by linear interpolation with maximum entropy, IEEE Transactions on Machine Intelligence, vol. 28, no. 5, pp. 766-781, 2006
[GH01] C. Gentile and D.P. Helmbold, Improved lower bounds for learning from noisy examples: an information-theoretic approach, Information and Computation, vol. 166, pp. 133-155, 2001
[Grü06] P.D. Grünwald, The Minimum Description Length Principle, MIT Press, 2006
[HT07] P. Harremoës and N. Tishby, The information bottleneck revisited or how to choose a good distortion measure, in Proceedings of the IEEE International Symposium on Information Theory, Nice, France, 2007
[HO97] D. Haussler and M. Opper, Mutual information, metric entropy and cumulative entropy risk, The Annals of Statistics, vol. 25, no. 6, pp. 2451-2492, 1997
[Kul68] S. Kullback, Information Theory and Statistics, Dover, 1968
[LR09] S. Lazebnik and M. Raginsky, Supervised learning of quantizer codebooks by information loss minimization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 7, pp. 1294-1309, 2009
[Lin01] T. Linder, Learning-theoretic methods in vector quantization, in Principles of Nonparametric Learning (ed. by L. Györfi), Springer, 2001
[LLZ94] T. Linder, G. Lugosi and K. Zeger, Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding, IEEE Transactions on Information Theory, vol. 40, no. 6, pp. 1728-1740, 1994
[MF98] N. Merhav and M. Feder, Universal prediction, IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2124-2147, 1998
[PKP06] J.B. Predd, S.R. Kulkarni and H.V. Poor, Consistency in models for distributed learning under communication constraints, IEEE Transactions on Information Theory, vol. 52, no. 1, pp. 52-63, 2006
[Rag08] M. Raginsky, Joint fixed-rate universal lossy coding and identification of continuous-alphabet memoryless sources, IEEE Transactions on Information Theory, vol. 54, no. 7, pp. 3059-3077, 2008
[Rag09] M. Raginsky, Joint universa lossy coding and identification of stationary mixing sources with general alphabets, IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 1945-1960, 2009
[Rag09a] M. Raginsky, Achievability results for statistical learning under communication constraints, in Proceedings of the IEEE International Symposium on Information Theory, Seoul, South Korea, 2009
[Ris86] J. Rissanen, Stochastic complexity and modeling, The Annals of Statistics, vol. 14, no. 3, pp. 1080-1100, 1986
[SK08] K. Sridharan and S.M. Kakade, An information theoretic framework for multi-view learning, in Proceedings of 21st Annual Conference on Learning Theory (COLT), 2008
[SKF08] M.W. Seeger, S.M. Kakade and D.P. Foster, Information consistency of nonparametric Gaussian process methods, IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 2376-2382, 2008
[TPB99] N. Tishby, F. Pereira and W. Bialek, The information bottleneck method, in Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, 1999
[WJ08] M.J. Wainwright and M.I. Jordan, Graphical models, exponential families, and variational inference, Foundations and Trends in Machine Learning, vol. 1, nos. 1-2, pp. 1-305, 2008
[WO08] M.B. Westover and J.A. O'Sullivan, Achievable rates for pattern recognition, IEEE Transactions on Information Theory, vol. 54, no. 1, pp. 299-320, 2008
[Zha06] T. Zhang, Information-theoretic upper and lower bounds for statistical estimation, IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1307-1321, 2006
Comments (4)
  • daxue001  - dfdf
    The National Cancer Institute’s Clinical Trials Cooperative Group Program — which coordinates and facilitates trials of new cancer treatments — is Authentic Christian Louboutin Sandals“at a breaking point” and needs an “overhaul,” says a report released today by the Institute of Medicine.

    The IOM recommends that the clinical trials program — which includes more than 3,100 institutions — consolidate some administrative functions now handled by its 10 individual cooperative research groups and pay clinicians for designing andReplica Christian Louboutin Sandals implementing the trials. On the science side, it says the program should mandate the submission of tissue and blood samples from study participants to centralYouth Christian Louboutin Sandals repositories and that any resulting data should have no intellectual property restrictions. To encourage patient participation,Official Christian Louboutin Sandals the IOM calls on insurers to pay for the non-experimental portion of stud...
  • daxue001
    Under the healthcare law passed in March, insurers must adjust their spending habits to meet new requirements. For example, large group plans must spend at least 85 cents of every premium dollar paid to them on actual medical care as opposed to administrative costs,China Christian Louboutin Pumps while individual and small group plans must spend 80 cents.

    Wall Street closely watches such spending levels, known as medical-loss ratios, or MLRs, as a sign of potential profits. Major health insurance stock indexes fell after the report.

    "The insurance industry Discount Christian Louboutin Pumpsis beginning to consider the financial impact of the new federally required (medical) loss ratio requirements, including questionable changes in their accounting practices," the Democratic-led Senate Committee on Commerce, Science and Transportation said in a statement.

    For example, WellPoint Inc "has already 'reclassified' more than half a billion dollarsCheap Christian Loubo...
  • skinnymaggie  - silver tiffany
    All trades, arts, and handiworks have gained by division of labour, namely, when, instead of one man doing everything, each confines himself to a certain kind of work distinct from others in the treatment it requires,Silver Tiffany so as to be able to perform it with greater facility and in the greatest perfection. Tiffany JewelryWhere the different kinds of work are not distinguished and divided, where everyone is a jack-of-all-trades, Tiffany Braceletsthere manufactures remain still in the greatest barbarism. Tiffany NecklacesIt might deserve to be considered whether pure philosophy in all its parts Tiffany Earringsdoes not require a man specially devoted to it, and whether it would not be better for the whole business of science if those who, to please the tastes of the public, are wont to blend the rational and empirical elements together, mixed in all sorts of proportions unknown to themselves, and who call themselves independent thinkers, giving the name of minute philosophers...
Only registered users can write comments!

!joomlacomment 4.0 Copyright (C) 2009 Compojoom.com . All rights reserved."

Last Updated on Saturday, 01 August 2009 19:05