6
$\begingroup$

There are many examples in which making more "precise" predictions gives worse performance (e.g. Runge's phenomenon). My professor implied that there was a sound basis for choosing "simple" functions over complex ones in the general case, and that it had to do with information theory.

Does anyone know what he was referring to?

As an example: consider least square's. Obviously we could find a polynomial of very high degree which has zero error, but we prefer a linear equation with higher error. Why should this be?

(I am familiar with some basic notions like entropy, but not much more than that, so simpler explanations would be much preferred. Although I understand that if it's complex, it's complex.)

  • 0
    For doing what?2010-12-21
  • 0
    @Qiaochu: The specific example was interpolation by a polynomial, and why splines are preferred to interpolating at a large number of points. I understood him to mean it was a general phenomenon though.2010-12-21
  • 1
    The problem with high-degree polynomials is that they tend to wiggle. A lot.2010-12-21
  • 1
    Also, the adage "high order does not imply high accuracy" applies as well.2010-12-21
  • 0
    This might be helpful. http://isites.harvard.edu/fs/docs/icb.topic539621.files/lec7.pdf2010-12-21

2 Answers 2

4

Look up VC theory (named after Vapnik and Chervonenkis), and especially Structural Risk Minimization (SRM).

  • 1
    http://arxiv.org/pdf/nlin/0307015 for instance2011-05-18
1

There's a fundamental tradeoff in inference between simplicity and accuracy. More accurate prediction require greater complexity, and are liable to 'overfit' - which is essentially learning something from small sample sizes that isn't actually there in the distribution itself (or large sample sizes).

In IT, the tradeoff is often canonicized by entropy - we shoot for a distribution that maximizes the entropy, subject to some constraints which we are willing to take (usually these come from the data). For example, suppose you are flipping a coin, and trying to decide whether the coin is fair. First flip its heads. Should you conclude at this point that the coin is biased to always land hands? Of course not! That would be major overfit. Now, if you flipped 10 times and it was heads every time, that would qualify as 'more constraints', thus allowing you to narrow down the true distribution (maybe its 90% heads, 10% tails).