Material Detail

The Use of Randomization and Statistical Significance in Data Mining

The Use of Randomization and Statistical Significance in Data Mining

This video was recorded at Practical Theories for Exploratory Data Mining (PTDM), Brussels 2012. The concept and theory of statistical significance testing is well established in a traditional setup, but not in the problem settings related to data mining. In this talk I discuss the formulation as well as advantages and limitations of the statistical significance testing approaches in data mining. A data mining problem, where the objective is to find patterns such as frequent sets or clusterings, can be formulated as a statistical significance testing problem if one can (i) define a null hypothesis, (ii) formulate a reasonable test statistic(s), and (iii) either map patterns to constraints to null hypothesis or map each pattern a test statistic of its own, the latter case resulting to a multiple hypothesis testing setup. The data mining problems rarely allow analytic formulation; hence, randomization methods that are needed to sample the null hypothesis are a key ingredient. The significance testing approach can be viewed as a way to solve the regularization problem in data analysis: it is possible to use a relatively simple and efficient data mining algorithms, and use the significance testing to take the noise into account in a principled manner. The significance testing is especially suitable to explorative data analysis, because it is possible, on one hand, to use constraints to null hypothesis to incorporate what we know about the data, and on other hand, to find patterns that tell most of what we do not yet know about the data. We present formulations of the exploration problem with some some theoretical results.


  • User Rating
  • Comments
  • Learning Exercises
  • Bookmark Collections
  • Course ePortfolios
  • Accessibility Info

More about this material


Log in to participate in the discussions or sign up if you are not already a MERLOT member.