Material Detail

Mining Massive Data Sets

Mining Massive Data Sets

This video was recorded at NATO Advanced Study Institute on Mining Massive Data Sets for Security. Today, the amount of data coming from all possible sources is enormous and growing at a fast pace due, in large part, to the ubiquitous Web and its increasing presence in our everyday life; but also to emails, cell phones, credit cards, retail, finance ... These data serve all sorts of functions : from query and search, to extracting information, providing services as well as managing security. Many fields are involved : statistics, data mining, text mining, data streams, search, social networks ... There is no lack of sophisticated techniques produced by academic activity, where challenges mostly deal with novelty, accuracy, and scalability of algorithms. However, in real-world applications, challenges are quite different : scalability (usually one or two orders of magnitude more than in academic publications), ease-of-use and capability to integrate efficient techniques into working systems in a transparent way, while always producing value for the customer. Real-world solutions are complex and usually need to integrate many technical components, from the various fields mentioned before: it thus becomes important to assess how these fields can complement one another. In the first part of the talk, I will present the challenges of real-world data mining applications. I will introduce the general Statistical Learning Theory framework and discuss some of the technical issues involved (large dimension data sets, missing data, outliers, non-i.i.d. structured data, unlabelled data ...) In the second part, I will show, taking examples from the implementation in KXEN and applications developed, how a theoretical framework (Structural Risk Minimization [1]) can be used to solve some of the challenges met in the real-world. I will finally describe some open practical issues which will require further theoretical investigation.


  • User Rating
  • Comments
  • Learning Exercises
  • Bookmark Collections
  • Course ePortfolios
  • Accessibility Info

More about this material


Log in to participate in the discussions or sign up if you are not already a MERLOT member.