Material Detail

The Use of Randomization and Statistical Significance in Data Mining

This video was recorded at Practical Theories for Exploratory Data Mining (PTDM), Brussels 2012. The concept and theory of statistical significance testing is well established in a traditional setup, but not in the problem settings related to data mining. In this talk I discuss the formulation as well as advantages and limitations of the statistical significance testing approaches in data mining. A data mining problem, where the objective is to find patterns such as frequent sets or clusterings, can be formulated as a statistical significance testing problem if one can (i) define a null hypothesis, (ii) formulate a reasonable test statistic(s), and (iii) either map patterns to constraints to null hypothesis or map each pattern a test statistic of its own, the latter case resulting to a multiple hypothesis testing setup. The data mining problems rarely allow analytic formulation; hence, randomization methods that are needed to sample the null hypothesis are a key ingredient. The significance testing approach can be viewed as a way to solve the regularization problem in data analysis: it is possible to use a relatively simple and efficient data mining algorithms, and use the significance testing to take the noise into account in a principled manner. The significance testing is especially suitable to explorative data analysis, because it is possible, on one hand, to use constraints to null hypothesis to incorporate what we know about the data, and on other hand, to find patterns that tell most of what we do not yet know about the data. We present formulations of the exploration problem with some some theoretical results.

Keywords:: videolectures, ocwc, oec

Disciplines:

Science and Technology / Computer Science / Programming & Programming Languages

Go to Material

Bookmark / Add to Course ePortfolio

Create a Learning Exercise

Add Accessibility Information

Rate

Add a Comment

Quality

User Rating
Comments
Learning Exercises
Bookmark Collections
Course ePortfolios
Accessibility Info

Report Broken Link
Report as Inappropriate

More about this material

Material Type:: Presentation
Date Added to MERLOT:: February 10, 2015
Date Modified in MERLOT:: February 10, 2015
Author:: Kai Puolamäki, Finnish Institute of Occupational Health
Submitter:: The Open Education Consortium
Primary Audience:: College General Ed, College Lower Division, College Upper Division
Technical Format:: Video

Mobile Compatibility:: Not specified at this time
Language:: English
Cost Involved:: No
Source Code Available:: No
Creative Commons:: This work is licensed under a Attribution-NonCommercial-NoDerivs 3.0 United States