Title: (Preventing) Overfitting in Adaptive Data Analysis

Speaker: Jon Ullman

Abstract: How can we use a random sample to draw meaningful conclusions about populations, without overfitting to the sample? This problem remains vexing despite decades of study by statisticians. One cause of overfitting is so-called adaptive data analysis---the common practice where datasets are re-used for multiple analyses, and the outcome of one analysis can influence the choice of future analyses. Adaptivity can arise in a range of ways, including data shared across research groups, multi-stage statistical methods, and data science competitions with shared validation data. Unfortunately, adaptivity invalidates traditional methods for preventing overfitting because it breaks a fundamental assumption about the independence of the data and the method.

In this talk I will introduce a relatively recent approach to understanding and preventing false discovery in adaptive data analysis based on the concept of algorithmic stability. Specifically, I will introduce and motivate the problem of adaptive data analysis, describe a model for studying this problem, and demonstrate how overfitting can occur and, to some extent, be mitigated in this model.