Edo Liberty Ph.D
Research Scientist
Yahoo! Research
Haifa, Isreal
edo.liberty@ymail.com
עידו ליברטי
homepage
Class description:
Data Mining is concerned with efficiently extracting statistics,
patterns, structures, or meanings from raw data. This task becomes
hard when the amount of data is large, which is often the case in
modern data sets. This course will survey modern algorithms,
concepts, and data structures common to data mining in large data
sets. We will try to cover, among other topics: data sampling,
finding causal relations and frequent item sets, counting in data
streams, ranking and sorting, approximating large matrix
operations, dimensionality reduction and efficient searching in
high dimensions. We will also discuss modern cluster architectures
and computational models.
I recommend that students be familiar with probability theory,
basic combinatorics, linear algebra, basic complexity theory, and
traditional data structures, at least on at introductory level.
The class will attempt to be self contained nonetheless.
Some details...
The class takes place 9:00 to 12:00 in room 112 of the Dan David
building. (We are no longer in Dan David 204)
Undergraduate students:
- Final exam (50% of the grade).
- Home assignments (50% of the grade).
- A project (for extra credit) is optional for students who are
interested.
Master and PhD students:
- No final exam.
- Home assignments (50% of the grade).
- Final project (50% of the grade). Should require roughly a
week's worth of work, comparable to learning for and taking the
final exam. These project can contain both theoretical and
experimental elements.
Class notes
Lesson 1: Why data mining? Presentation.
Lesson 1: Probability recap. Class
notes.
Lesson 1: Mark and recapture. Class notes.
Lesson 2: Sampling. Class notes.
Lesson 2: The Chernoff bound. Class notes by John Canny)
Homework 1: i.i.d. sampling from streams. Assignment, solution.
Lesson 3: Dinamic Item Set Counting. Class notes.
Lesson 3: A Simple Algorithm for Finding Frequent Elements in
Streams and Bags
Class notes.
Lesson 4: Finding Frequent Items in Data Streams. Class
notes.
Lesson 4: Notes on Bloom filters vs. bit hashes. Class notes.
Lesson 4: Notes on Count Sketches. Class notes.
Lesson 5: Approximating the frequency moments. Class notes.
Homework 2: Approximate set unions. Assignment,
solution.
Lesson 6: Random Projections. Class notes.
Lesson 7: Fast Random Projections
Lesson 8: Singular Value Decomposition (SVD) Class notes.
Lesson 9: the power method.
Class notes.
Lesson 9: matrix sampling by Ahlswede and Winter Paper.
Lesson 10: Nearest Neighbor search, KD-trees and other space
partitioning solutions.
Lesson 11: Approximate Nearest Neighbor search, Local Sensitive
Hashing class notes.
Homework 3: Document duplicate detection Assignment.
Lesson 12: Searching, Inverted indexes and the set Containment
problem. class notes.
Lesson 13: Material review. Example test questions.
Lesson 14: Introduction to spectral graph theory Class notes 1, class notes 2, and
class notes 3. All
by Daniel A. Spielman.
Exam 1: "Moed Alef"
Exam 1: "Moed Alef"
answers.
Exam 2: "Moed Beit"
Exam 2: "Moed Beit"
answers (comming soon)