This video was recorded at Machine Learning Summer School (MLSS), Tübingen 2003. These are lectures on some fundamental mathematics underlying many approaches and algorithms in machine learning. They are not about particular learning algorithms; they are about the basic concepts and tools upon which such algorithms are built. Often students feel intimidated by such material: there is a vast amount of "classical mathematics", and it can be hard to find the wood for the trees. The main topics of these lectures are Lagrange multipliers, functional analysis, some notes on matrix analysis, and convex optimization. I've concentrated on things that are often not dwelt on in typical CS coursework. Lots of examples are given; if it's green, it's a puzzle for the student to think about. These lectures are far from complete: perhaps the most significant omissions are probability theory, statistics for learning, information theory, and graph theory. I hope eventually to turn all this into a series of short tutorials. Please let me know of any errors, etc. ; :from Chris Burges homepage : http://research.microsoft.com/~cburges Lecture contains: Lagrange multipliers: * Lagrange the Mathematician * Lagrange multipliers: an indirect approach can be easier * Multiple Equality Constraints * Multiple Inequality Constraints * Two points on a d-sphere * The Largest Parallelogram * Resource allocation * A convex combination of numbers is maximized by choosing the largest * The Isoperimetric problem * For fixed mean and variance, which univariate distribution has maximum entropy? * An exact solution for an SVM living on a simplex Notes on some Basic Statistics * Probabilities can be Counter-Intuitive (Simpson's paradox; the Monty Hall puzzle) * IID-ness: Measurement Error decreases as 1/sqrt{n} * Correlation versus Independence * The Ubiquitous Gaussian: Product of Gaussians is Gaussian Convolution of two Gaussians is a Gaussian Projection of a Gaussian is a Gaussian Sum of Gaussian random variables is a Gaussian random variables Uncorrelated Gaussian variables are also independent Maximum Likelihood Estimates for mean and covariance (prove required matrix identities) Aside: For 1-dim Laplacian, max. likelihood gives the median * Using cumulative distributions to derive densities Principal Component Analysis and Generalizations * Ordering by Variance * Does Grouping Change Things? * PCA Decorrelates the Samples * PCA gives Reconstruction with Minimal Mean Squared Error * PCA preserves Mutual Information on Gaussian data * PCA directions lie in the span of the data * PCA: second order moments only * The Generalized Rayleigh Quotient Non-orthogonal principal directions OPCA Fisher Linear Discriminant Multiple Discriminant Analysis Elements of Functional Analysis * High Dimensional Spaces * Is Winning Transitive? * Most of the Volume is Near the Surface: Cubes * Spheres in n-dimensions * Banach Spaces, Hilbert Spaces, Compactness * Norms * Useful Inequalities (Minkowski and Holder) * Vector Norms * Matrix Norms * The Hamming Norm * L1, L2, L_infty norms - is L0 a norm? * Example: Using a Norm as a Constraint in Kernel Algorithms