

For a technical discussion, see lecture 9 of Andrew Ng's machine learning class at Stanford.

Intuitively, the criterion to evaluate multiple models should be more stringent, and a naive approach would be to apply a Bonferroni correction. Rather the problem is that implicitly multiple tests of hypothesis are being run at the same time. For examples of a typical protocol and criteria, check Ch.7 of Hastie-Friedman-Tibshirani's " The Elements of Statistical Learning". In fact, each model may have been fitted using cross-validation on the training set, or other in-sample criteria like AIC, BIC, Mallows etc. This has nothing to do with bias-variance trade-offs. If the number of models is high enough, there is a non-negligible probability that the predictions provided by one model will be considered good. You fit each of these models, on a test data set, and then check the performance of the model prediction on a hold-out sample. Suppose that you have a time series of returns for a single asset, and that you have a large number of candidate model families. Strictly speaking, data snooping is not the same as in-sample vs out-of-sample model selection and testing, but has to deal with sequential or multiple tests of hypothesis based on the same data set.
