Research Scientist
Yahoo! Research
Haifa, Isreal
edo.liberty@ymail.com
עידו ליברטי
homepage
Data Mining is concerned with efficiently extracting statistics, patterns, structures, or meanings from raw data. This task becomes hard when the amount of data is large, which is often the case in modern data sets. This course will survey modern algorithms, concepts, and data structures common to data mining in large data sets. We will try to cover, among other topics: data sampling, finding causal relations and frequent item sets, counting in data streams, ranking and sorting, approximating large matrix operations, dimensionality reduction and efficient searching in high dimensions. We will also discuss modern cluster architectures and computational models.
I recommend that students be familiar with probability theory, basic combinatorics, linear algebra, basic complexity theory, and traditional data structures, at least on at introductory level. The class will attempt to be self contained nonetheless.
Undergraduate students:
http://groups.google.com/group/algorithms-in-data-mining-tau
Class notes will be posted here as the sememster progresses.
Lesson 1: Introduction
presentation.
Lesson 1:
Probability recap class notes.
Lesson 1: Mark
and recapture class notes.
Lesson 2: Sampling class
notes.
Lesson 2: Chernoff
bound class notes by John Canny.
Lesson 3:
Frequent items in streams class notes.
Assignment 1: Size of
Trees and Approximate Histograms. Due Monday Nov 28th
Lesson 4: Frequency
Moments class notes.
Lesson 5: Random
Projections class notes.
Lesson 6: Singular
Value Decomposition.
Assignment 2: Weak
Random Projection. Due Monday Dec 26th. Solutions
Lesson 7: Matrix
Sampling and efficient rank-k approximation
Lesson 7: Roman
Versynin's proof of the matrix Chernoff bound by Ahlswede and
Winter
Lesson 8: Metric
Embedding into random trees (Notes from a Metric Embedding class
given by the late Avner Magen)
Lesson 9: Nearest
neighbor search, and kd-trees
Lesson 10: Approximate
nearest neighbor search, Locallity Sensitive Hashing
Lesson 11: k-means
clustering
Assignment 2: Meta-algorithms
and the power method. Due Monday Jan 30th
Lesson 12: Very
short intro to spectral graph theory (partial and temporary
class notes). Based on class notes by
Dan Spielman.
Lesson 13: Review of last year's exam and answers "Moed Alef".