Material Detail

Large Scale Learning with String Kernels

This video was recorded at NIPS Workshop on Efficient Machine Learning, Whistler 2007. In applications of bioinformatics and text processing, such as splice site recognition and spam detection, large amounts of training sequences are available and needed to achieve sufficiently high prediction performance on classification or regression tasks. Although kernel-based methods such as SVMs often achieve state-of-the-art results, training and evaluation times may be prohibitively large. When single kernel computation time is already linear (w.r.t. the input sequences) it seems difficult to achieve further speed ups. In this work we describe an efficient technique for computing linear combinations of string kernels using sparse data structures such as explicit maps, sorted arrays and suffix tries, trees or arrays [5]. As computing linear combinations of kernels make up the dominant part of SVM training and evaluation, speeding up their computation is essential. Considering the recently proposed and successfully used linear time string kernels, like the Spectrum kernel [2] and the Weighted Spectrum kernel [3] we show that one can accelerate SVM training by factors of 7 and 60 times, respectively, while requiring considerably less memory. Our method allows us to train string kernel SVMs on sets as large as 10 million sequences [4]. Moreover, using these techniques the evaluation on new sequences is often several thousand times faster, allowing us to apply the classifiers on genome-sized data sets with seven billion test examples [6]. The presented algorithms are implemented in our Machine Learning toolbox SHOGUN for which the source code is publicly available at http://www.shogun-toolbox.org. References: [1] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge University Press, 1997. [2] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, pages 564–575, Kaua'i, Hawaii, 2002. [3] G. R¨atsch and S. Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans, pages 277–298. MIT Press series on Computational Molecular Biology. MIT Press, 2004. [4] S. Sonnenburg, P. Philips, G. Schweikert, and G. R¨atsch. Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8, 2007. [5] S. Sonnenburg, G. R¨atsch, and K. Rieck. Large-scale learning with string kernels. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, Neural Information Processing Series, pages 73–104. MIT Press, Cambridge, MA, 2007. [6] S. Sonnenburg, A. Zien, and G. R¨atsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.

Keywords:: videolectures, ocwc, oec

Disciplines:

Science and Technology / Computer Science / Programming & Programming Languages

Go to Material

Bookmark / Add to Course ePortfolio

Create a Learning Exercise

Add Accessibility Information

Rate

Add a Comment

Quality

User Rating
Comments
Learning Exercises
Bookmark Collections
Course ePortfolios
Accessibility Info

Report Broken Link
Report as Inappropriate

More about this material

Material Type:: Presentation
Date Added to MERLOT:: February 8, 2015
Date Modified in MERLOT:: February 8, 2015
Author:: Sören Sonnenburg, Machine Learning and Intelligent Data Analysis Group, TU Berlin
Submitter:: The Open Education Consortium
Primary Audience:: College General Ed, College Lower Division, College Upper Division
Technical Format:: Video

Mobile Compatibility:: Not specified at this time
Language:: English
Cost Involved:: No
Source Code Available:: No
Creative Commons:: This work is licensed under a Attribution-NonCommercial-NoDerivs 3.0 United States